Project Conceptual Diagram

This document summarizes a collection of solutions for extracting recovery measures information from Recovery Documents using Reproducible Analytical Pipelines. Findings will help identify best practices for future data entry and reporting. Outcomes of this project will include:

For more information about this project, please see the project proposal.

workflow

Exploration of options for Step 1 to Step 2: Examples below display outcomes for Table 4 of Mud-Piddock Recovery Assessment.

Option1

This first section provides a summary of solutions if tables in PDFs are the preferred way to store information. Examples below display outcomes for Table 4 of Mud-Piddock Recovery Assessment.

Original screenshot of table in the PDF

tabulizer R Package

tabulizer provides R bindings to the Tabula java library, which can be used to computationaly extract tables from PDF documents. Package ‘tabulizer’ was removed from the CRAN repository thus installation was conducted directly from GitHub. This could leave us at the mercy of package developers that could discontinue its maintenance. More here: https://github.com/ropensci/tabulizer

#remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
library(remotes)
library(rJava)
library(tabulizer)
tabletest <- "Rs-PholadeTronqueeAtlMudPiddock-v00-2022Aug-Eng.pdf"
#fullextraction <- extract_tables(tabletest) # You may uncomment this line if you would like to extract tables form the whole document.
table4 <- extract_tables(tabletest, pages = 20)
table4
## [[1]]
##       [,1]                   [,2]                   [,3]            [,4]      
##  [1,] ""                     ""                     "Likelihood of" "Level of"
##  [2,] "Threat"               "Geographic scale"     ""              ""        
##  [3,] ""                     ""                     "occurrence"    "impact"  
##  [4,] ""                     ""                     ""              ""        
##  [5,] ""                     "Minas Basin/"         ""              ""        
##  [6,] "Climate change"       ""                     "Known"         "High"    
##  [7,] ""                     "Cobequid Bay"         ""              ""        
##  [8,] ""                     ""                     ""              ""        
##  [9,] "Alteration of"        "Cobequid Bay (with"   ""              ""        
## [10,] ""                     "no consideration for" ""              ""        
## [11,] "shoreline or water"   ""                     "Likely"        "Low"     
## [12,] ""                     "potential far-field"  ""              ""        
## [13,] "control structures"   ""                     ""              ""        
## [14,] ""                     "effects)"             ""              ""        
## [15,] "Mineral exploration"  ""                     ""              ""        
## [16,] "or extraction"        ""                     ""              ""        
## [17,] ""                     "Minas Basin/"         ""              ""        
## [18,] "activities in Minas"  ""                     "Unlikely"      "Unknown" 
## [19,] ""                     "Cobequid Bay"         ""              ""        
## [20,] "Basin and nearby"     ""                     ""              ""        
## [21,] "rivers"               ""                     ""              ""        
## [22,] "Large-scale tidal"    ""                     ""              ""        
## [23,] ""                     "Minas Basin"          "Unlikely"      "Unknown" 
## [24,] "energy extraction"    ""                     ""              ""        
## [25,] "Other sources of"     ""                     ""              ""        
## [26,] "pollution, including" "Minas Basin/"         ""              ""        
## [27,] ""                     ""                     "Remote"        "Unknown" 
## [28,] "non-point source"     "Cobequid Bay"         ""              ""        
## [29,] "pollution"            ""                     ""              ""        
## [30,] "Release of"           ""                     ""              ""        
## [31,] "petroleum products"   "Minas Basin/"         ""              ""        
## [32,] ""                     ""                     "Remote"        "Extreme" 
## [33,] "in Gulf of Maine and" "Cobequid Bay"         ""              ""        
## [34,] "Bay of Fundy"         ""                     ""              ""        
##       [,5]        [,6]      [,7]           [,8]         [,9]        
##  [1,] "Causal"    "Threat"  "Threat"       "Threat"     "Threat"    
##  [2,] ""          ""        ""             ""           ""          
##  [3,] "certainty" "risk"    "occurrence"   "frequency"  "extent"    
##  [4,] ""          ""        ""             ""           ""          
##  [5,] ""          ""        ""             ""           ""          
##  [6,] "Low"       "High"    "Current"      "Continuous" "Extensive" 
##  [7,] ""          ""        ""             ""           ""          
##  [8,] ""          ""        ""             ""           ""          
##  [9,] ""          ""        ""             ""           ""          
## [10,] ""          ""        ""             ""           ""          
## [11,] "Very Low"  "Low"     "Anticipatory" "Recurrent"  "Narrow"    
## [12,] ""          ""        ""             ""           ""          
## [13,] ""          ""        ""             ""           ""          
## [14,] ""          ""        ""             ""           ""          
## [15,] ""          ""        ""             ""           ""          
## [16,] ""          ""        ""             ""           ""          
## [17,] ""          ""        ""             ""           ""          
## [18,] "Low"       "Unknown" "Anticipatory" "Continuous" "Restricted"
## [19,] ""          ""        ""             ""           ""          
## [20,] ""          ""        ""             ""           ""          
## [21,] ""          ""        ""             ""           ""          
## [22,] ""          ""        ""             ""           ""          
## [23,] "Medium"    "Unknown" "Anticipatory" "Continuous" "Broad"     
## [24,] ""          ""        ""             ""           ""          
## [25,] ""          ""        ""             ""           ""          
## [26,] ""          ""        ""             ""           ""          
## [27,] "Low"       "Unknown" "Anticipatory" "Single"     "Narrow"    
## [28,] ""          ""        ""             ""           ""          
## [29,] ""          ""        ""             ""           ""          
## [30,] ""          ""        ""             ""           ""          
## [31,] ""          ""        ""             ""           ""          
## [32,] "Medium"    "Low"     "Anticipatory" "Recurrent"  "Extensive" 
## [33,] ""          ""        ""             ""           ""          
## [34,] ""          ""        ""             ""           ""

The output is a bit convoluted with a series of empty spaces and disjointed/unconnected text. Thus next step is to turn the table extraction into a data frame to get an output that resembles the original table a tiny bit more, yet issues prevailed and a bit more tweaking is required (coding side and/or way tables are designed = the simpler the table, the easier for the code to extract it).

table4df <- extract_tables(tabletest, pages = 20, output = "data.frame") %>% as.data.frame()
table4df
##                       X                  X.1 Likelihood.of Level.of    Causal
## 1                Threat     Geographic scale                                 
## 2                                               occurrence   impact certainty
## 3                                                                            
## 4                               Minas Basin/                                 
## 5        Climate change                              Known     High       Low
## 6                               Cobequid Bay                                 
## 7                                                                            
## 8         Alteration of   Cobequid Bay (with                                 
## 9                       no consideration for                                 
## 10   shoreline or water                             Likely      Low  Very Low
## 11                       potential far-field                                 
## 12   control structures                                                      
## 13                                  effects)                                 
## 14  Mineral exploration                                                      
## 15        or extraction                                                      
## 16                              Minas Basin/                                 
## 17  activities in Minas                           Unlikely  Unknown       Low
## 18                              Cobequid Bay                                 
## 19     Basin and nearby                                                      
## 20               rivers                                                      
## 21    Large-scale tidal                                                      
## 22                               Minas Basin      Unlikely  Unknown    Medium
## 23    energy extraction                                                      
## 24     Other sources of                                                      
## 25 pollution, including         Minas Basin/                                 
## 26                                                  Remote  Unknown       Low
## 27     non-point source         Cobequid Bay                                 
## 28            pollution                                                      
## 29           Release of                                                      
## 30   petroleum products         Minas Basin/                                 
## 31                                                  Remote  Extreme    Medium
## 32 in Gulf of Maine and         Cobequid Bay                                 
## 33         Bay of Fundy                                                      
##     Threat     Threat.1   Threat.2   Threat.3
## 1                                            
## 2     risk   occurrence  frequency     extent
## 3                                            
## 4                                            
## 5     High      Current Continuous  Extensive
## 6                                            
## 7                                            
## 8                                            
## 9                                            
## 10     Low Anticipatory  Recurrent     Narrow
## 11                                           
## 12                                           
## 13                                           
## 14                                           
## 15                                           
## 16                                           
## 17 Unknown Anticipatory Continuous Restricted
## 18                                           
## 19                                           
## 20                                           
## 21                                           
## 22 Unknown Anticipatory Continuous      Broad
## 23                                           
## 24                                           
## 25                                           
## 26 Unknown Anticipatory     Single     Narrow
## 27                                           
## 28                                           
## 29                                           
## 30                                           
## 31     Low Anticipatory  Recurrent  Extensive
## 32                                           
## 33

The solution below still requires a bit more development but the simple extraction seems similar to the original table 4.

library(kableExtra)
table4df %>% kbl(caption = "Table 4") %>%
         kable_classic(full_width = F, html_font = "Cambria") 
Table 4
X X.1 Likelihood.of Level.of Causal Threat Threat.1 Threat.2 Threat.3
Threat Geographic scale
occurrence impact certainty risk occurrence frequency extent
Minas Basin/
Climate change Known High Low High Current Continuous Extensive
Cobequid Bay
Alteration of Cobequid Bay (with
no consideration for
shoreline or water Likely Low Very Low Low Anticipatory Recurrent Narrow
potential far-field
control structures
effects)
Mineral exploration
or extraction
Minas Basin/
activities in Minas Unlikely Unknown Low Unknown Anticipatory Continuous Restricted
Cobequid Bay
Basin and nearby
rivers
Large-scale tidal
Minas Basin Unlikely Unknown Medium Unknown Anticipatory Continuous Broad
energy extraction
Other sources of
pollution, including Minas Basin/
Remote Unknown Low Unknown Anticipatory Single Narrow
non-point source Cobequid Bay
pollution
Release of
petroleum products Minas Basin/
Remote Extreme Medium Low Anticipatory Recurrent Extensive
in Gulf of Maine and Cobequid Bay
Bay of Fundy

Demos:

docxtractr R Package

PDFs seem to have some tricky issues consistently observed while trying different solutions. What about Word Files? We saved Table 4 as a separate Word file and tested docxtractr R Package.

screenshot of table in Word is displayed above.

The solution below still requires a bit more development but the simple extraction seems similar to the original table 4. Keeping tables as simple as possible in the word or PDF file is one of the main pieces to implement moving forward.

library(docxtractr)
tabletestdocx <-  read_docx("table4.docx")
docx_tbl_count(tabletestdocx)
## [1] 1
tbls <- docx_extract_all_tbls(tabletestdocx)
tbls
## [[1]]
## # A tibble: 7 × 9
##   V1                             V2    V3    V4    V5    V6    V7    V8    V9   
##   <chr>                          <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Threat                         Geog… Like… Leve… Caus… Thre… Thre… Thre… Thre…
## 2 Climate change                 Mina… Known High  Low   High  Curr… Cont… Exte…
## 3 Alteration of shoreline or wa… Cobe… Like… Low   Very… Low   Anti… Recu… Narr…
## 4 Mineral exploration or extrac… Mina… Unli… Unkn… Low   Unkn… Anti… Cont… Rest…
## 5 Large-scale tidal energy extr… Mina… Unli… Unkn… Medi… Unkn… Anti… Cont… Broad
## 6 Other sources of pollution, i… Mina… Remo… Unkn… Low   Unkn… Anti… Sing… Narr…
## 7 Release of petroleum products… Mina… Remo… Extr… Medi… Low   Anti… Recu… Exte…
library(kableExtra)
tbls %>% kbl(caption = "Table 4") %>%
         kable_classic(full_width = F, html_font = "Cambria") 
Table 4
V1 V2 V3 V4 V5 V6 V7 V8 V9
Threat Geographic scale Likelihood of occurrence Level of impact Causal certainty Threat risk Threat occurrence Threat frequency Threat extent
Climate change Minas Basin/ Cobequid Bay Known High Low High Current Continuous Extensive
Alteration of shoreline or water control structures Cobequid Bay (with no consideration for potential far-field effects) Likely Low Very Low Low Anticipatory Recurrent Narrow
Mineral exploration or extraction activities in Minas Basin and nearby rivers Minas Basin/ Cobequid Bay Unlikely Unknown Low Unknown Anticipatory Continuous Restricted
Large-scale tidal energy extraction Minas Basin Unlikely Unknown Medium Unknown Anticipatory Continuous Broad
Other sources of pollution, including non-point source pollution Minas Basin/ Cobequid Bay Remote Unknown Low Unknown Anticipatory Single Narrow
Release of petroleum products in Gulf of Maine and Bay of Fundy Minas Basin/ Cobequid Bay Remote Extreme Medium Low Anticipatory Recurrent Extensive

Camelot Python Package

Camelot is a python package, installable from Pip that can extract tables from a PDF based on the lines between cells (Lattice). This helps improve the quality of the parsed data.

from camelot import read_pdf

doc_file_path = "Rs-PholadeTronqueeAtlMudPiddock-v00-2022Aug-Eng.pdf"
df_list = read_pdf(doc_file_path, pages="20", line_scale=50)
# this is a list, because there might be more than one table on the page
# print everything:
print(df_list[0].df.to_string())
##                                                                                        0                                                                           1                           2                  3                   4              5                    6                   7                8
## 0                                                                                 Threat                                                            Geographic scale  Likelihood of \noccurrence  Level of \nimpact  Causal \ncertainty  Threat \nrisk  Threat \noccurrence  Threat \nfrequency  Threat \nextent
## 1                                                                         Climate change                                                 Minas Basin/ \nCobequid Bay                       Known               High                 Low           High              Current          Continuous        Extensive
## 2                                Alteration of \nshoreline or water \ncontrol structures  Cobequid Bay (with \nno consideration for \npotential far-field \neffects)                      Likely                Low            Very Low            Low         Anticipatory           Recurrent           Narrow
## 3  Mineral exploration \nor extraction \nactivities in Minas \nBasin and nearby \nrivers                                                 Minas Basin/ \nCobequid Bay                    Unlikely            Unknown                 Low        Unknown         Anticipatory          Continuous       Restricted
## 4                                                  Large-scale tidal \nenergy extraction                                                                 Minas Basin                    Unlikely            Unknown              Medium        Unknown         Anticipatory          Continuous            Broad
## 5                 Other sources of \npollution, including \nnon-point source \npollution                                                 Minas Basin/ \nCobequid Bay                      Remote            Unknown                 Low        Unknown         Anticipatory              Single           Narrow
## 6                  Release of \npetroleum products \nin Gulf of Maine and \nBay of Fundy                                                 Minas Basin/ \nCobequid Bay                      Remote            Extreme              Medium            Low         Anticipatory           Recurrent        Extensive

There are a couple potential issues with this method. Special characters are explicitly recorded (newlines are saved as ) and items such as graphs can be mistaken as tables.

Docx Python Package

Docx is a python package, installable from Pip (pip install python-docx) that can extract tables from word documents.

from docx import Document
recovery_docx = Document("Rs-PholadeTronqueeAtlMudPiddock-v00-2022Aug-Eng.docx")
# recovery_docx.tables is a list containing all of the tables in the docx, print the 3rd one:
for row in recovery_docx.tables[3].columns:
    for cell in row.cells:
        print(cell.text)
## Habitat type
## Tidal pools
## Boulders
## 
## Capstone
## Rivulets
## 
## Patches
## Habitat use
## In the bottom of tidal pools with water >0.5 cm deep and an area greater than a patch (>1.0 m2)
## Around the base of large cobbles or boulders where tidal scour causes sediment flushing
## Adjacent to or under resistant capstone that protects underlying mudstone against scour from ice and other materials
## In rivulets or channels where freshwater runoff or the tide causes sediment flushing
## In patches in the intertidal zone covered by water <0.5 cm deep and an area smaller than a tidal pool (<1.0 m2)

This method produces a very reliable output that is fairly easy to work with.

Tabula

  • Tabula is free and open source. Tabula was created by journalists for journalists and anyone else working with data locked away in PDFs. https://tabula.technology/
    Software can be used to extract data manually, or folks can implement a solution programatically using python.

Demos:

Azure Form Recognizer

Azure Form Recognizer is a cloud-based Azure Applied AI Service that analyzes forms and documents, extracts text and data, and maps field relationships as key-value pairs: https://formrecognizer.appliedai.azure.com/studio

Azure Form Recognizer could potentially allow coordination with tools withing our Microsoft Suite. It is powered by Python.

Azure Form Recognizer does not exist in our current suite of tools. We logged in a ticket with IT and they confirmed that in order to complete our request we would need you fill out a form so that software can be evaluated by IT security before installing to make sure it is safe to have it on our network. It does not seem to be an open source solution (0-500 pages are free per month, after that, there is a cost associated with using this tool https://azure.microsoft.com/en-ca/pricing/details/form-recognizer/)

We have not tested this option yet as there seems to be a price tag associated with number of pages depending on the capacity of data mining.

Demos:

Implementing solutions for Step 1 to Step 2: Examples below display outcomes for two Word Documents slighly modified

DiagrammeR::grViz("digraph {
  graph [layout = dot, rankdir = TB]
  
  node [shape = rectangle]        
  rec1 [label = 'PDF Recovery Documents']
  rec2 [label = 'Create Word Recovery Document using Foxit']
  rec3 [label = 'Identify key terms of tables to be extracted in the Word document']
  rec4 [label = 'Test extraction of relevant tables']

  rec1 -> rec2 -> rec3 -> rec4 
  }",
  height = 500)

Word files seem to be providing a really nice output. Our next task was to explore which tables specifically will be mined. For this, we manually compared the spreadsheet information shared by the SARA Program and the source document/word files for Atlantic Mud-piddock and Blue whale.

Column Name in Word Doc Table Column Name in Spreadsheet
Species Species Name
Threat Not included
Geographic scale Not included
Likelihood of occurrence Not included
Level of impact Not included
Causal certainty Not included
Threat risk Not included
Threat occurrence Not included
Threat frequency Not included
Threat extent Not included
General description of research and management approaches Recovery Measures/ Conservation Measures
Priority Not included
Broad strategy Not included
Threat or concern addressed Not included
Column Name in Spreadsheet Column Name in Word Doc Table
Designatable Unit (DU) Not included
Taxon Not included
COSEWIC Status Not included
SARA Status Not included
Lead Region Not included
Broad Strategy Not included
Document Used Not included
Document Reference Not included
Column name in Spreadsheet Tab Recovery Measure Column name in Spreadsheet Tab Recovery Measure Tracking
Recovery Measures/ Conservation Measures AP: Recovery Measures

We tested the process of adding a potential new table in the Recovery Documents/Action Plans that could be added in future word files with information that could be mined into an authoritative spreadsheet.

Proposed new table

We also modified and standardized names of the additional columns to be mined from the word documents into the spreadsheets. These are tentative names until SARA Program decides which names could be used systematically in all future documents. This will make the process of mining tables significantly simpler as the codes will be looking for this unique names within tables. For this test, we did the following changes in the two word documents:

The above mentioned changes have led to the following key words to mine the two test word documents:

Note: solutions will not be case sensitive.

Mining tables from Word files proved to be the most effective solution using the least amount of coding and manipulation. We briefly tested one of the documents with the new proposed tables.

library(docxtractr)
testdocx <-  read_docx("Rs-PholadeTronqueeAtlMudPiddock-v00-2022Aug-Eng.docx")
docx_tbl_count(testdocx)
## [1] 25
alltables <- docx_extract_all_tbls(testdocx)
str(alltables)
## List of 25
##  $ : tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
##   ..$ Species.Name     : chr "Atlantic Mud-piddock"
##   ..$ Designatable.Unit: chr "NA"
##   ..$ Taxon            : chr "Marine Mollusc"
##   ..$ COSEWIC.Status   : chr "TH"
##   ..$ SARA.Status      : chr "TH"
##   ..$ Lead.Region      : chr "Maritimes"
##   ..$ Document.Used    : chr "Recovery Strategy"
##  $ : tibble [2 × 7] (S3: tbl_df/tbl/data.frame)
##   ..$ Kingdom: chr [1:2] "Animalia" "Animals"
##   ..$ Phylum : chr [1:2] "Mollusca" "Clams, snails, octopuses, squid, scallops, oysters, andchitons"
##   ..$ Class  : chr [1:2] "Bivalva" "Marine and freshwater molluscs with a two- part hingedshell"
##   ..$ Order  : chr [1:2] "Myida" "Burrowing salt water clams"
##   ..$ Family : chr [1:2] "Pholadidae" "Piddocks"
##   ..$ Genus  : chr [1:2] "Barnea" "Distinguished from other Pholadidae by morphological characteristics"
##   ..$ Species: chr [1:2] "Barnea truncata" "Atlantic Mud- piddock"
##  $ : tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ X             : chr [1:2] "Core sites" "Peripheral sites"
##   ..$ Stable.sites  : chr [1:2] "Tennycape, Burntcoat Head, Noel Bay, Sloop Rocks, Shad Creek, and MungoBrook" "Spencer Point and Port Williams"
##   ..$ Unstable.sites: chr [1:2] "N/A" "Parrsboro, Kingsport, Evangeline Beach, Economy Point, and Five Islands"
##  $ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Habitat.type: chr [1:5] "Tidal pools" "Boulders" "Capstone" "Rivulets" ...
##   ..$ Habitat.use : chr [1:5] "In the bottom of tidal pools with water >0.5 cm deep and an area greater than a patch (>1.0 m2)" "Around the base of large cobbles or boulders where tidal scour causes sediment flushing" "Adjacent to or under resistant capstone that protects underlying mudstone against scour from ice and other materials" "In rivulets or channels where freshwater runoff or the tide causes sediment flushing" ...
##  $ : tibble [6 × 9] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat                  : chr [1:6] "Climate change" "Alteration of shoreline or water control structures" "Mineral exploration or extraction activities in Minas Basin and nearby rivers" "Large-scale tidal energy extraction" ...
##   ..$ Geographic.scale        : chr [1:6] "Minas Basin/ Cobequid Bay" "Cobequid Bay (with no consideration for potential far-field effects)" "Minas Basin/ Cobequid Bay" "Minas Basin" ...
##   ..$ Likelihood.of.occurrence: chr [1:6] "Known" "Likely" "Unlikely" "Unlikely" ...
##   ..$ Level.of.impact         : chr [1:6] "High" "Low" "Unknown" "Unknown" ...
##   ..$ Causal.certainty        : chr [1:6] "Low" "Very Low" "Low" "Medium" ...
##   ..$ Threat.risk             : chr [1:6] "High" "Low" "Unknown" "Unknown" ...
##   ..$ Threat.occurrence       : chr [1:6] "Current" "Anticipatory" "Anticipatory" "Anticipatory" ...
##   ..$ Threat.frequency        : chr [1:6] "Continuous" "Recurrent" "Continuous" "Continuous" ...
##   ..$ Threat.extent           : chr [1:6] "Extensive" "Narrow" "Restricted" "Broad" ...
##  $ : tibble [1 × 9] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat                  : chr "Recreation and adventure sport activities"
##   ..$ Geographic.scale        : chr "Minas Basin/ Cobequid Bay"
##   ..$ Likelihood.of.occurrence: chr "Known"
##   ..$ Level.of.impact         : chr "Low"
##   ..$ Causal.certainty        : chr "Medium"
##   ..$ Threat.risk             : chr "Low"
##   ..$ Threat.occurrence       : chr "Current"
##   ..$ Threat.frequency        : chr "Recurrent"
##   ..$ Threat.extent           : chr "Restricted"
##  $ : tibble [10 × 5] (S3: tbl_df/tbl/data.frame)
##   ..$ X.                         : chr [1:10] "1" "2" "3" "4" ...
##   ..$ Recovery.measures          : chr [1:10] "1. Monitor, and refine knowledge of, threats to the species" "2. Research and monitor sedimentation regimes in the Minas Basin and at Atlantic Mud-piddock habitat sites, inc"| __truncated__ "3. Establish and implement a monitoring program to survey distribution of available habitat and live individual"| __truncated__ "4. Increase understanding of life history characteristics (for example, age structure, recruitment, mortality)" ...
##   ..$ Priority5                  : chr [1:10] "Medium" "Medium" "Medium" "Low" ...
##   ..$ Broad.strategy             : chr [1:10] "Research and monitoring" "Research and monitoring" "Research and monitoring" "Research and monitoring" ...
##   ..$ Threat.or.concern.addressed: chr [1:10] "All threats" "Sedimentation, climate change" "Habitat availability, population abundance and distribution" "Knowledge gaps" ...
##  $ : tibble [4 × 4] (S3: tbl_df/tbl/data.frame)
##   ..$ General.description.of.research.and.management.approaches: chr [1:4] "" "" "" ""
##   ..$ Priority5                                                : chr [1:4] "" "" "" ""
##   ..$ Broad.strategy                                           : chr [1:4] "" "" "" ""
##   ..$ Threat.or.concern.addressed                              : chr [1:4] "" "" "" ""
##  $ : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
##   ..$ Life.stage   : chr [1:3] "Egg" "Trochophore larva" "Veliger larva"
##   ..$ Function6    : chr [1:3] "Egg development and growth" "Growth" "Settlement"
##   ..$ Feature.s.7  : chr [1:3] "Water column in intertidal and subtidal zones" "Water column in intertidal and subtidal zones" "Water column in intertidal and subtidal zones"
##   ..$ Attribute.s.8: chr [1:3] "Salinity in intertidal zone: 5 to 25 pptSalinity in subtidal zone9: 20.0 to29.5 pptTemperature and oxygen level"| __truncated__ "Salinity in intertidal zone: 5 to 25 pptSalinity in subtidal zone: 20.0 to29.5 pptTemperature and oxygen levels"| __truncated__ "Salinity in intertidal zone: 5 to 25 pptSalinity in subtidal zone: 20.0 to29.5 pptTemperature and oxygen levels"| __truncated__
##  $ : tibble [5 × 4] (S3: tbl_df/tbl/data.frame)
##   ..$ Life.stage   : chr [1:5] "" "Veliger larva" "Sub-adult" "Sub-adult" ...
##   ..$ Function6    : chr [1:5] "" "Settlement" "Boring and growth" "Boring and growth" ...
##   ..$ Feature.s.7  : chr [1:5] "" "Red mudstone in the intertidal zone associated with the following habitat types: tide pools, rivulets, boulders"| __truncated__ "Red mudstone in the intertidal zone associated with the following habitat types: tide pools, rivulets, boulders"| __truncated__ "Water column in the intertidal zone" ...
##   ..$ Attribute.s.8: chr [1:5] "High slack tide for settlement" "Available red mudstone surfaces to settle upon" "Available red mudstone of sufficient depth (~5 cm) within which to bore and matureNo rapid or significant accum"| __truncated__ "Salinity: 5 to 25 pptTemperature and oxygen levels within natural range of variation (0° to 21°C and ~5 mg/L)Su"| __truncated__ ...
##  $ : tibble [4 × 4] (S3: tbl_df/tbl/data.frame)
##   ..$ Life.stage   : chr [1:4] "" "Adult" "Adult" "All (except egg)"
##   ..$ Function6    : chr [1:4] "" "Reproduction" "Reproduction" "Feeding"
##   ..$ Feature.s.7  : chr [1:4] "Water column in the intertidal zone" "Red mudstone in the intertidal zone associated with the following habitat types: tide pools, rivulets, boulders"| __truncated__ "Water column in the intertidal zone" "Food supply"
##   ..$ Attribute.s.8: chr [1:4] "Salinity: 5 to 25 pptTemperature and oxygen levels within natural range of variation (0° to 21°C and ~5 mg/L)No"| __truncated__ "Available mudstone from which adults can release eggs and spermNo rapid or significant accumulation of sediments" "Salinity: 5 to 25 pptOxygen levels within natural range of variation (~5 mg/L)Suitable temperature for external"| __truncated__ "Sufficient quality and quantity of food (plankton and particulates)"
##  $ : tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ Description.of.study: chr [1:4] "Habitat mapping via foot surveys to determine the amount of available and occupied habitat at inhabited and extirpated sites." "Habitat mapping using unmanned aerial vehicle (UAV) and high accuracy survey-grade GPS technology to identify p"| __truncated__ "Conduct research on spawning to determine timing, concurrent water temperature, and other environmental determinants." "Conduct research and modelling to refine understanding of temporal, spatial, and environmental features and att"| __truncated__
##   ..$ Rationale           : chr [1:4] "This will allow for comparisons of the extent of available and occupied habitat at known Atlantic Mud- piddock "| __truncated__ "This work will supplement foot surveys to inform characterization of habitat availability as either stable, inc"| __truncated__ "The habitat attributes required for spawning are unknown. This study is expected to improve understanding of ha"| __truncated__ "Little is known about the habitat requirements and movement of larvae in the Minas Basin. Modelling larval dist"| __truncated__
##   ..$ Timeline10          : chr [1:4] "5 years and ongoing" "5 years and ongoing" "5 years" "5 years"
##  $ : tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat               : chr "Climate change"
##   ..$ Activity             : chr "Global greenhouse gas emissions from human activities"
##   ..$ Effect..pathway      : chr "Increase in the frequency, length, and severity of storms and rainfall, causing disruption to sediment depositi"| __truncated__
##   ..$ Function.s..affected : chr "Egg development and growthTrochophore larva growthVeliger larva settlementSub-adult boring and growthAdult grow"| __truncated__
##   ..$ Feature.s..affected  : chr "Water in intertidal and subtidal zonesRed mudstone in the intertidal zone where water is present at high and lo"| __truncated__
##   ..$ Attribute.s..affected: chr "Temperature and oxygen levels within natural range of variation (0° to 21°C and ~5 mg/L)Suitable temperature fo"| __truncated__
##  $ : tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat               : chr [1:2] "Alteration of shoreline or water control structures" "Mineral exploration or extraction activities in Minas Basin and nearby rivers"
##   ..$ Activity             : chr [1:2] "DredgingConstruction of coastal infrastructure (for example, marina, wharf)Installation of shoreline- modifying"| __truncated__ "Dredging"
##   ..$ Effect..pathway      : chr [1:2] "Current alteration, causing reduced red mudstone availability or smothering through siltation events" "Disruption of sediments, causing reduced red mudstone availability or smothering through sedimentation events"
##   ..$ Function.s..affected : chr [1:2] "Veliger larva settlementSub-adult boring and growthAdult growthAdult reproduction" "Veliger larva settlementSub-adult boring and growthAdult growthAdult reproduction"
##   ..$ Feature.s..affected  : chr [1:2] "Red mudstone in the intertidal zone where water is present at high and low tide and is associated with the foll"| __truncated__ "Red mudstone in the intertidal zone where water is present at high and low tide and is associated with the foll"| __truncated__
##   ..$ Attribute.s..affected: chr [1:2] "Available red mudstone surfaces upon which to settle, and within which to bore, mature, and reproduceNo rapid o"| __truncated__ "Available red mudstone surfaces upon which to settle, and within which to bore, mature, and reproduceNo rapid o"| __truncated__
##  $ : tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat               : chr [1:2] "Large-scale tidal energy extraction" "Sources of pollution, including point and non-point source pollution"
##   ..$ Activity             : chr [1:2] "Deployment and operation of tidal power turbines within the Minas Basin for energy extraction" "Agricultural activities leading to run- off (for example, pesticides, animal waste)Urban activities leading to "| __truncated__
##   ..$ Effect..pathway      : chr [1:2] "Alteration of tidal, and thus sedimentation regimes within the Minas Basin, reducing red mudstone availability "| __truncated__ "Introduction of pollutants, resulting in reduction in water quality and food availability"
##   ..$ Function.s..affected : chr [1:2] "Veliger larva settlementSub-adult boring and growthAdult growthAdult reproductionLarval, sub-adult, and adult feeding" "Egg development and growthTrochophore larva growthVeliger larva settlementSub-adult boring and growthAdult grow"| __truncated__
##   ..$ Feature.s..affected  : chr [1:2] "Red mudstone in the intertidal zone where water is present at high and low tide and is associated with the foll"| __truncated__ "Water in intertidal and subtidal zones"
##   ..$ Attribute.s..affected: chr [1:2] "No rapid or significant accumulation of sedimentsAvailable red mudstone surfaces upon which to settle, and with"| __truncated__ "Oxygen levels within natural range of variation (0° to 21°C and ~5 mg/L)Sufficient water qualitySufficient qual"| __truncated__
##  $ : tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat               : chr "Recreation and adventure sport activities"
##   ..$ Activity             : chr "Running eventsBiking eventsAll terrain- vehicle use"
##   ..$ Effect..pathway      : chr "Frequent or intense site use resulting in degradation of substrate and reduced red mudstone availability"
##   ..$ Function.s..affected : chr "Veliger larva settlementSub-adult growth and boringAdult growthAdult reproduction"
##   ..$ Feature.s..affected  : chr "Red mudstone in the intertidal zone where water is present at high and low tide and is associated with the foll"| __truncated__
##   ..$ Attribute.s..affected: chr "Available red mudstone surfaces upon which to settle, bore, mature, and reproduceRed mudstone of sufficient dep"| __truncated__
##  $ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Likelihood.of.occurrence: chr [1:5] "Known or very likely to occur" "Likely to occur" "Unlikely" "Remote" ...
##   ..$ Definition              : chr [1:5] "This threat has been recorded to occur 91 to 100%" "There is 51 to 90% chance that this threat is or will be occurring." "There is 11 to 50% chance that this threat is or will be occurring" "There is 1 to 10% or less chance that this threat is or will be occurring." ...
##  $ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Level.of.impact: chr [1:5] "Extreme" "High" "Medium" "Low" ...
##   ..$ Definition     : chr [1:5] "Severe population decline (for example, 71 to 100%) with the potential for extirpation." "Substantial loss of population (31 to 70%) orThreat would jeopardize the survival or recovery of the population." "Moderate loss of population (11 to 30%) orThreat is likely to jeopardize the survival or recovery of the population." "Little change in population (1 to 10%) orThreat is unlikely to jeopardize the survival or recovery of the population." ...
##  $ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Causal.certainty: chr [1:5] "Very high" "High" "Medium" "Low" ...
##   ..$ Definition      : chr [1:5] "Very strong evidence that threat is occurring and the magnitude of the impact to the population can be quantified." "Substantial evidence of a causal link between threat and population decline or jeopardy to survival or recovery" "There is some evidence linking the threat to population decline or jeopardy to survival or recovery" "There is a theoretical link with limited evidence that threat is leading to a population decline or jeopardy to"| __truncated__ ...
##  $ : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat.occurrence: chr [1:3] "Historical" "Current" "Anticipatory"
##   ..$ Definition       : chr [1:3] "A threat that is known to have occurred in the past and negatively impacted the population." "A threat that is ongoing, and is currently negatively impacting the population." "A threat that is anticipated to occur in the future, and will negatively impact the population."
##  $ : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat.frequency: chr [1:3] "Single" "Recurrent" "Continuous"
##   ..$ Definition      : chr [1:3] "The threat occurs once." "The threat occurs periodically, or repeatedly." "The threat occurs without interruption."
##  $ : tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Threat.extent: chr [1:4] "Extensive" "Broad" "Narrow" "Restricted"
##   ..$ Definition   : chr [1:4] "71 to 100% of the population is affected by the threat." "31 to 70% of the population is affected by the threat." "11 to 30% of the population is affected by the threat." "1 to 10% of the population is affected by the threat."
##  $ : tibble [35 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ Water.feature.name: chr [1:35] "Avon River" "Avon River" "Bass Creek (Cambridge)" "Bass Creek (Cambridge)" ...
##   ..$ Latitude          : chr [1:35] "45° 6' 31.324\" N" "45° 7' 7.156\" N" "45° 11' 57.748\" N" "45° 11' 57.000\" N" ...
##   ..$ Longitude         : chr [1:35] "64° 13' 26.971\" W" "64° 11' 55.178\" W" "64° 8' 3.999\" W" "64° 8' 2.953\" W" ...
##  $ : tibble [41 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ Water.feature.name: chr [1:41] "Gr Great Village River eat Village River" "Habitant Creek" "Habitant Creek" "Harrington River" ...
##   ..$ Latitude          : chr [1:41] "45° 23' 40.765\" N" "45° 9' 3.393\" N" "45° 8' 41.817\" N" "45° 24' 38.152\" N" ...
##   ..$ Longitude         : chr [1:41] "63° 36' 37.443\" W" "64° 22' 49.369\" W" "64° 22' 47.492\" W" "64° 5' 56.519\" W" ...
##  $ : tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ Water.feature.name: chr [1:24] "Portapique River 1" "Portapique River 1" "Portapique River 2" "Portapique River 2" ...
##   ..$ Latitude          : chr [1:24] "45° 23' 35.228\" N" "45° 23' 38.577\" N" "45° 23' 37.432\" N" "45° 23' 39.398\" N" ...
##   ..$ Longitude         : chr [1:24] "63° 42' 25.630\" W" "63° 42' 27.379\" W" "63° 42' 53.690\" W" "63° 42' 52.239\" W" ...
library(kableExtra)
alltables[[1]] %>% kbl(caption = "Testing Full Extraction of Tables") %>%
         kable_classic(full_width = F, html_font = "Cambria") 
Testing Full Extraction of Tables
Species.Name Designatable.Unit Taxon COSEWIC.Status SARA.Status Lead.Region Document.Used
Atlantic Mud-piddock NA Marine Mollusc TH TH Maritimes Recovery Strategy

User Stories

DFO staff is tasked with mining tables from Recovery Documents PDF/Word files to organized them in a nice and clean spreadsheet for future reporting. Below are potential future scenarios of what their experience could be using different approaches.

Scenario 1: Staff mines tables directly from current Recovery Document PDF/Word files

Staff mine tables that have weird double lines, non matching lines and background white lines separating the cells. Significant amount of manual intervention, extra coding and tweaking is required for each table in a customized fashion. This scenario is a bit cumbersome and requires extra manipulation.

Scenario 2: Staff mines tables directly from improved Recovery Document Word files that include additional metadata (a.k.a. tables).

Using a Reproducible Analytical Pipeline, staff mine tables directly from improved Word files provided by the authors, or generated from the authoritative PDFs using software like Foxit. Improved Word documents follow explicit guidance to create metadata (a.k.a. tables) that are mined at the same time as the filepath when scraping the document. Staff mine tables efficiently and easily because authors rigorously applied the following recommendations for each table:

  • Tables rows do not break from page to page
  • Columns names are unique, short, succinct, and fully standardized in all recovery documents
  • Unique column names are KEY because code will be looking for those in the tables extracted
  • Tables do not include numbers/codes in the values of tables. Instead, tables include standardized text to make the mining process as simple as possible e.g. instead of using “1” for coding Broad strategies, staff use the direct wording “Research and monitoring”.

Scenario 3: Staff takes advantage of a new workflow to store and mine the data using a GUI

Mining PDF and Word files is a possibility. Is there a better approach for mining data and storing it in an authoritative place? Data could be mined from something different than a PDF/Word file. Something such as database, excel sheets, GUI, which could be populated as part of a submission. In this case, standardized tables are not a deal breaker for mining information.

The evaluation of any scenario must take into account the time and user experience of i) folks writing/publishing SARA documents, ii) folks mining these documents, iii) Staff needed to maintain new tools.

Recommendations for input spreadsheet

Need a bit of cleaning before interactive report pilot - happy to do cleaning with recommendations on how to keep spreadsheets as clean as possible for coding (Excel spreadsheets, CSV, Databases)

Flags & Questions TBD

  • Are PDFs or Word Files the best way to store metadata (a.k.a) tables from Recovery Documents?
    • Word seems to work great particularly if very specific conditions are applied for every future publication. This could increase time folks spend formatting.
  • When mining tables in the future from the Recovery Document PDF/Word file:
    • Do you envision mining tables directly from the Recovery Document PDF/Word file directly? We could slect one of the options presented in this document to explore further.
    • Or are you open to recommendations in the workflow to store and mine the data from somewhere else (e.g. database, excel sheets populated as part of submission)?

Step 2 to Step 3: Elements required to generate reproducible reports

Conceptual workflow of elements that would be required to reverse-engineer outcomes 1 and 2 (above) by using forms, csv files, or Excel spreadsheets to generate reproducible reports (e.g. using Microsoft PowerBI, R Markdown, or other tools that are easily used and accessible to Species at Risk Program staff).

Option2

Proposed Milestones:

  1. To develop an interactive Power BI tool to play with outputs of the spreadsheets. This interactive outputs can inform reporting, and can support SARA staff explore questions on the fly, using the most up-to-date database or datafame.

  2. To explore the feasibility of creating a Word skeleton of a SARA document that will pre-populate as much information as possible in a reproducible manner. To tackle this, we will explore options were users may not need to interact with code directly, but can take advantage of a powerful pipeline.

While there is a steep learning curve to building this pipeline, documents will be much easier to update following the addition of new data as all tables and figures should automatically update. This workflow will also improve the transparency and reproducibility of the work, especially if changes are tracked using git.

To test this idea, we will aim to create a package called SARAdown, which will generate a skeleton word file with information pre-poulated for single-species or multi-species reports. We will test this pilot using the template under development by the SARA program: https://086gc.sharepoint.com/:w:/r/sites/MaritimesSpatialPlanning-MAROpenDataLogistics/_layouts/15/Doc.aspx?action=edit&sourcedoc=%7B5a7c2d52-5121-454f-84d1-947f37bf0658%7D&wdOrigin=TEAMS-ELECTRON.teamsSdk.openFilePreview&wdExp=TEAMS-CONTROL

This project will draw ideas and code from the following packages: