Example use of the output of the NFS Data-Documenteation-Procedure

Author

Roberto Scotti NFS-DipAGR-UniSS.it

Published

January 13, 2023

Intro

NuoroForestrySchool (a study center of the Department of Agricolture, University of Sassari, Italy) has developed and published a ‘data documentation procedure’ (link to NFS-DDP) enabling the improvement of the FAIRness of the dataset that any data collector wishes to share as open data.

Datasets are very frequently shared as spreadsheet files. While this tool is very handy when one is preparing and starting to analyse the data, it is not at all a good structure for the archival of consolidated data and it is particularly weak as exchange means, unless data structures are extremely simple.

NFS-DDP takes in input a spreadsheet where the data has been organized as relational tables, one per sheet, and 4 additional sheets are used as standard metadata tables. The procedure outputs an SQLite relational database (including data and metadata) and a pdf-file documenting the database structure and contents.

In the PANGEA repository Giadrossich et al. (2022) have shared the data concerning a soil erosion experiment deployed in the forest with artificial rain. The zip-archive that can be downloaded contains the experiment data and metadata processed by NFS-DDP. Using this as an example output, in the following this example output will be exploited producing an essential summary table.

suppressPackageStartupMessages(library(tidyverse))

Retrieving data shared trough PANGAEA

q_string <- "Giadrossich 'runoff and soil erosion with and without forest'"
# a string that identifies the dataset to select, it can be the DOI
library(pangaear)

Registered S3 method overwritten by 'hoardr':
  method           from
  print.cache_info httr

(doi <- 
    pg_search(query = q_string)|> 
    pluck("doi"))

[1] "10.1594/PANGAEA.943451"

pg_cache$delete_all()
res <- pg_data(doi = doi)

Downloading 1 datasets from 10.1594/PANGAEA.943451

Processing 1 files

(files <- pg_cache$list() |> 
  unzip(list = T))

                                                                                Name
1              Giadrossich-etal_2022/2018_SoilErosionTrials_Marganai_Sardegna.sqlite
2 Giadrossich-etal_2022/2018_SoilErosionTrials_Marganai_Sardegna_DBdocumentation.pdf
  Length                Date
1 114688 2022-04-22 09:28:00
2 390027 2022-04-22 09:28:00

grep(".zip", pg_cache$list(), value = T) |>
  unzip(exdir = pg_cache$cache_path_get())

Display database ER schema

The zip-archive contains, besides the SQLite database, a pdf-file documenting in readable form how the data have been generated and the structure of the relational tables.

Reading and understanding the documentation is at least helpful, when not necessary, before proceeding to processing. In particular the ER schema that the documentation contains represents the basic ‘map’ for the development of the processing code.

The following block extracts the ER diagram from the pdf-file and displays it in a browser window. After thaking a look, close the window to proceed.

library(magick)

Linking to ImageMagick 6.9.10.23
Enabled features: fontconfig, freetype, fftw, lcms, pango, webp, x11
Disabled features: cairo, ghostscript, heic, raw, rsvg

Using 4 threads

pdf <- grep(".pdf", files$Name, value = T)
system(paste("pdfimages -all", 
             paste0(pg_cache$cache_path_get(), "/", pdf),  
             paste0(pg_cache$cache_path_get(), "/img")))
list.files(pg_cache$cache_path_get())

[1] "10_1594_PANGAEA_943451.zip" "Giadrossich-etal_2022"     
[3] "img-000.png"                "img-001.png"

file.path(pg_cache$cache_path_get(), "img-000.png") |> 
  image_read() |> 
  image_display()

#  image_browse()

Access to the SQLite database

Retrieving and decompressing (unzipping) the shared archive, the extracted files have been saved in the default folder defined by the ‘pangaear’ R-package. To access the database package functions are used to build the ‘path’.

library(DBI)
db_file <- file.path(pg_cache$cache_path_get(),
                     grep(".sqlite", files$Name, value = T))
con <- dbConnect(RSQLite::SQLite(), dbname = db_file)
dbListTables(con)

 [1] "Bloks"                  "Nozzle_pressure"        "Plots"                 
 [4] "RainGauge"              "RainGaugeReferenceArea" "Runoff"                
 [7] "Sediment"               "Trials"                 "VWC"                   
[10] "meta.DDICT"             "meta.REFERENCES"        "meta.TABLES"

Reading data tables

The code blocks above are quite generic. Modifying the initial query string in order to retrieve a different NFS-DDP output archive, shared on PANGAEA, the processing will arrive to this point.

From here on code is specifically tailored to produce the intended processing on this particular dataset.

The aim here is to produce a table with synthesis statistics for the main measurements, hence only a limited set of tables has to be read and imported.

for(t in c("RainGaugeReferenceArea", "Sediment", "RainGauge", "Runoff"))
  assign(t, dbReadTable(con, t))

Extracting basic statistics

In a first step the different measurements are preprocessed and collected in a single ‘vertical’ table, containing the elementary data.

Preprocessing

Raw data have been registered. Some filtering is required and specific aggregation procedures have to be applied depending on data granularity.

RAINFALL

Some raingauge mesures have been taken externally, as controls, and have hence to be excluded (filter(Represented_area_mm2 > 0)).

Values of the three rain-gauges (L, R and U) are averaged (neglecting the possible weigth Represented_area_mm2) and providing directly a plot measure in [mm/h].

RUNOFF

For one of the plots (Block == 3, Cover == 'W', Trial == 2) the team ran out of collectors and the last 10 minutes are N/A. The measure has been completed using the last available measure as reference.

Run-off measures are registered in grams collected on 5 min. intervals. To produce the plot measure values have to be cumulated over the 30 min. of the trials, and transformed in [mm/h]. g2mm represents the conversion factor.

Solid matter

The two measures (total and organic fraction) are registered as plot level values.

Nel mean_sedimet_t_ha_h c’?? un x2 perch?? sono t/ha/anno, ovvero erosione stimata annua, e deriva dal fatto in quanto nel sito di studio si verificano mediamente due pioggie di quella intensit?? (vedi Vacca et al.). NOTA BENE: Nella tabella finale, comunque, è riportato il valore per il solo evento di pioggia relativo alla prova stessa, quindi senza il x2. NON CAPISCO. Se non è nella tabella dove è quel sediment con il x2?

# Organizing data as (Trial, Cover, Block) measures for each metric

n_r_ev <- 2 # average yearly rainfall events with this intensity
Duration <- 30 # min - artificial rain duration
plot_surface <- 1.5  # m^2  
g.plot2t.ha <- (1/10^6) * # g to EU tons
  (10^4 / plot_surface) #*   # by ha
  # n_r_ev                  # yearly impact
g2mm <- (1000 * 60/Duration) / (plot_surface * 1000^2) # m^2_to_mm^2
# ogni grammo = 1 cm^3, quindi 1000 mm^3 ottenuti in mezz'ora,
# per portare a un'ora faccio * 2,
# per trasformare in altezza (mm d'acqua) divido per la superficie 1500000 mm^2

# Runoff mesures completion
na_rpl <- Runoff |> 
  filter(Block == 3, Cover == 'W', Trial == 2, Time1 == 20) |> 
  select(Runoff_g) |> pluck(1)
Runoff <- Runoff |> replace_na(list(Runoff_g = na_rpl))

measures <-
  bind_rows(
    # Artificial rain
    RainGauge |>
      inner_join(RainGaugeReferenceArea) |>
      filter(Represented_area_mm2 > 0) |>
      group_by(Trial, Cover, Block) |>
      # summarise(Ra = weighted.mean(
      #  Rain_intensity_mmh.1, Represented_area_mm2
      summarise(Ra = mean(Rain_intensity_mmh.1)) |>
      transmute(Trial, Cover, Block, metric = "Rain_intensity_mmh.1", measure = Ra)
    ,
    
    # Runoff water
    Runoff |>
      group_by(Trial, Cover, Block) |>
      summarise(
        Ru0 = sum(Runoff_g),
        st = min(Time0),
        en = max(Time1)
      ) |>
      transmute(
        Trial,  Cover, Block,
        metric = "Runoff_g",
        measure = Ru0 * g2mm
      )
    ,
    
    # Solid matter transported, with C pct
    Sediment |>
      mutate(Sediment_t_ha = Sediment_tot_g * g.plot2t.ha) |> 
      select(-Sediment_tot_g) |> 
      pivot_longer(
        -c(Trial, Cover, Block),
        names_to = "metric",
        values_to = "measure"
      ) |>
      select(Trial, Cover, Block, metric, measure) |>
      arrange(metric, Trial, Cover, Block)
  )

Joining, by = "Position"
`summarise()` has grouped output by 'Trial', 'Cover'. You can override using
the `.groups` argument.
`summarise()` has grouped output by 'Trial', 'Cover'. You can override using
the `.groups` argument.

In a second step, numerousness, means and standard deviations are computed and pasted to a string for table display.

library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

measures |> 
  group_by(metric, Trial, Cover) |> 
  summarise(nblk = n(), mean = mean(measure), std = sd(measure),
            .groups = "drop") |> 
  mutate(dec = case_when(str_detect(metric, "Sediment") ~ 4, T ~ 1), 
         cell = paste(map2_chr(mean, dec, formatC, format = "f"), 
                      "±", 
                      map2_chr(std, dec, formatC, format = "f")
                      )
         ) |> 
  pivot_wider(c(Trial, Cover,  nblk), names_from = metric, values_from = cell) |>
  select(Trial, Cover,  nblk, 
         Rain_intensity_mmh.1, Runoff_g, Sediment_t_ha, Organic_matter_pct) |> 
  rename(
    "n. blocks" = nblk,
    "Rainfall [mm/h]" = Rain_intensity_mmh.1,
    "Runoff [mm/h]" = Runoff_g,
    # "Sediment [t*ha^1*year^1]" = Sediment_t_ha,
    "Sediment [t/ha]" = Sediment_t_ha,
    "Organic C [%]" = Organic_matter_pct) -> tab1

tab1 |> 
  kable(caption = paste("Average values and standard deviation of artificial rainfall (duration = ", Duration, "min) and corresponding water runoff, total solid transport (including organic matter) and organic C percentage"),
        align = "c") |>
  kable_styling(bootstrap_options = "striped")

Average values and standard deviation of artificial rainfall (duration = 30 min) and corresponding water runoff, total solid transport (including organic matter) and organic C percentage
Trial	Cover	n. blocks	Rainfall [mm/h]	Runoff [mm/h]	Sediment [t/ha]	Organic C [%]
1	W	8	43.8 ± 6.3	0.7 ± 0.8	0.0148 ± 0.0114	22.9 ± 8.9
1	WO	8	45.4 ± 6.2	1.0 ± 1.1	0.0097 ± 0.0083	17.5 ± 6.8
2	W	8	43.1 ± 8.4	1.3 ± 1.8	0.0112 ± 0.0061	21.2 ± 7.0
2	WO	8	42.2 ± 7.5	1.6 ± 2.1	0.0093 ± 0.0083	21.5 ± 4.8
3	WO	5	42.2 ± 5.2	1.0 ± 0.9	0.0238 ± 0.0448	20.7 ± 9.4

tab1 |> 
  kable(caption = paste("Average values and standard deviation of artificial rainfall (duration = ", Duration, "min) and corresponding water runoff, total solid transport (including organic matter) and organic C percentage"),
        align = "c", format = "latex") |>
  kable_styling(latex_options = "striped") |> 
  save_kable("tab1.tex")

Conclusion

These pages offer a simplified example of how to exploit data shared having used the NFS-DataDocumentationProcedure to make the archive more FAIR.

The procedure increases data FAIRness as the relational database explicitly presents and guarantees data coherence and understandability of the semantic connections. Having a standard report that documents the data gathering and final organization offers an effective quick overview of the ‘big picture’ and an efficient way to find details, when needed.