This document summarizes a collection of solutions for extracting recovery measures information from Recovery Documents using Reproducible Analytical Pipelines. Findings will help identify best practices for future data entry and reporting. Outcomes of this project will include:
Testing and evaluating the use of Reproducible Analytical Pipelines to automate data extraction from SARA recovery documents and Species at Risk geodatabase.
Make recommendations regarding document elements (e.g. formatting or codes) required to identify and extract relevant information into a spreadsheet. This will coincide with the Species at Risk Program recovery and implementation team plans to revise Species’ Progress Report templates and other recovery document templates this fiscal year.
Conceptual workflow of elements that would be required to reverse-engineer outcomes 1 and 2 (above) by using forms, csv files, or Excel spreadsheets to generate reproducible reports (e.g. using Microsoft PowerBI, R Markdown, or other tools that are easily used and accessible to Species at Risk Program staff).
For more information about this project, please see the project proposal.
workflow
Option1
This first section provides a summary of solutions if tables in PDFs are the preferred way to store information. Examples below display outcomes for Table 4 of Mud-Piddock Recovery Assessment.
Original screenshot of table in the PDF
tabulizer R Packagetabulizer provides R bindings to the Tabula java library, which can be used to computationaly extract tables from PDF documents. Package ‘tabulizer’ was removed from the CRAN repository thus installation was conducted directly from GitHub. This could leave us at the mercy of package developers that could discontinue its maintenance. More here: https://github.com/ropensci/tabulizer
#remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
library(remotes)
library(rJava)
library(tabulizer)
tabletest <- "Rs-PholadeTronqueeAtlMudPiddock-v00-2022Aug-Eng.pdf"
#fullextraction <- extract_tables(tabletest) # You may uncomment this line if you would like to extract tables form the whole document.
table4 <- extract_tables(tabletest, pages = 20)
table4
## [[1]]
## [,1] [,2] [,3] [,4]
## [1,] "" "" "Likelihood of" "Level of"
## [2,] "Threat" "Geographic scale" "" ""
## [3,] "" "" "occurrence" "impact"
## [4,] "" "" "" ""
## [5,] "" "Minas Basin/" "" ""
## [6,] "Climate change" "" "Known" "High"
## [7,] "" "Cobequid Bay" "" ""
## [8,] "" "" "" ""
## [9,] "Alteration of" "Cobequid Bay (with" "" ""
## [10,] "" "no consideration for" "" ""
## [11,] "shoreline or water" "" "Likely" "Low"
## [12,] "" "potential far-field" "" ""
## [13,] "control structures" "" "" ""
## [14,] "" "effects)" "" ""
## [15,] "Mineral exploration" "" "" ""
## [16,] "or extraction" "" "" ""
## [17,] "" "Minas Basin/" "" ""
## [18,] "activities in Minas" "" "Unlikely" "Unknown"
## [19,] "" "Cobequid Bay" "" ""
## [20,] "Basin and nearby" "" "" ""
## [21,] "rivers" "" "" ""
## [22,] "Large-scale tidal" "" "" ""
## [23,] "" "Minas Basin" "Unlikely" "Unknown"
## [24,] "energy extraction" "" "" ""
## [25,] "Other sources of" "" "" ""
## [26,] "pollution, including" "Minas Basin/" "" ""
## [27,] "" "" "Remote" "Unknown"
## [28,] "non-point source" "Cobequid Bay" "" ""
## [29,] "pollution" "" "" ""
## [30,] "Release of" "" "" ""
## [31,] "petroleum products" "Minas Basin/" "" ""
## [32,] "" "" "Remote" "Extreme"
## [33,] "in Gulf of Maine and" "Cobequid Bay" "" ""
## [34,] "Bay of Fundy" "" "" ""
## [,5] [,6] [,7] [,8] [,9]
## [1,] "Causal" "Threat" "Threat" "Threat" "Threat"
## [2,] "" "" "" "" ""
## [3,] "certainty" "risk" "occurrence" "frequency" "extent"
## [4,] "" "" "" "" ""
## [5,] "" "" "" "" ""
## [6,] "Low" "High" "Current" "Continuous" "Extensive"
## [7,] "" "" "" "" ""
## [8,] "" "" "" "" ""
## [9,] "" "" "" "" ""
## [10,] "" "" "" "" ""
## [11,] "Very Low" "Low" "Anticipatory" "Recurrent" "Narrow"
## [12,] "" "" "" "" ""
## [13,] "" "" "" "" ""
## [14,] "" "" "" "" ""
## [15,] "" "" "" "" ""
## [16,] "" "" "" "" ""
## [17,] "" "" "" "" ""
## [18,] "Low" "Unknown" "Anticipatory" "Continuous" "Restricted"
## [19,] "" "" "" "" ""
## [20,] "" "" "" "" ""
## [21,] "" "" "" "" ""
## [22,] "" "" "" "" ""
## [23,] "Medium" "Unknown" "Anticipatory" "Continuous" "Broad"
## [24,] "" "" "" "" ""
## [25,] "" "" "" "" ""
## [26,] "" "" "" "" ""
## [27,] "Low" "Unknown" "Anticipatory" "Single" "Narrow"
## [28,] "" "" "" "" ""
## [29,] "" "" "" "" ""
## [30,] "" "" "" "" ""
## [31,] "" "" "" "" ""
## [32,] "Medium" "Low" "Anticipatory" "Recurrent" "Extensive"
## [33,] "" "" "" "" ""
## [34,] "" "" "" "" ""
The output is a bit convoluted with a series of empty spaces and disjointed/unconnected text. Thus next step is to turn the table extraction into a data frame to get an output that resembles the original table a tiny bit more, yet issues prevailed and a bit more tweaking is required (coding side and/or way tables are designed = the simpler the table, the easier for the code to extract it).
table4df <- extract_tables(tabletest, pages = 20, output = "data.frame") %>% as.data.frame()
table4df
## X X.1 Likelihood.of Level.of Causal
## 1 Threat Geographic scale
## 2 occurrence impact certainty
## 3
## 4 Minas Basin/
## 5 Climate change Known High Low
## 6 Cobequid Bay
## 7
## 8 Alteration of Cobequid Bay (with
## 9 no consideration for
## 10 shoreline or water Likely Low Very Low
## 11 potential far-field
## 12 control structures
## 13 effects)
## 14 Mineral exploration
## 15 or extraction
## 16 Minas Basin/
## 17 activities in Minas Unlikely Unknown Low
## 18 Cobequid Bay
## 19 Basin and nearby
## 20 rivers
## 21 Large-scale tidal
## 22 Minas Basin Unlikely Unknown Medium
## 23 energy extraction
## 24 Other sources of
## 25 pollution, including Minas Basin/
## 26 Remote Unknown Low
## 27 non-point source Cobequid Bay
## 28 pollution
## 29 Release of
## 30 petroleum products Minas Basin/
## 31 Remote Extreme Medium
## 32 in Gulf of Maine and Cobequid Bay
## 33 Bay of Fundy
## Threat Threat.1 Threat.2 Threat.3
## 1
## 2 risk occurrence frequency extent
## 3
## 4
## 5 High Current Continuous Extensive
## 6
## 7
## 8
## 9
## 10 Low Anticipatory Recurrent Narrow
## 11
## 12
## 13
## 14
## 15
## 16
## 17 Unknown Anticipatory Continuous Restricted
## 18
## 19
## 20
## 21
## 22 Unknown Anticipatory Continuous Broad
## 23
## 24
## 25
## 26 Unknown Anticipatory Single Narrow
## 27
## 28
## 29
## 30
## 31 Low Anticipatory Recurrent Extensive
## 32
## 33
The solution below still requires a bit more development but the simple extraction seems similar to the original table 4.
library(kableExtra)
table4df %>% kbl(caption = "Table 4") %>%
kable_classic(full_width = F, html_font = "Cambria")
| X | X.1 | Likelihood.of | Level.of | Causal | Threat | Threat.1 | Threat.2 | Threat.3 |
|---|---|---|---|---|---|---|---|---|
| Threat | Geographic scale | |||||||
| occurrence | impact | certainty | risk | occurrence | frequency | extent | ||
| Minas Basin/ | ||||||||
| Climate change | Known | High | Low | High | Current | Continuous | Extensive | |
| Cobequid Bay | ||||||||
| Alteration of | Cobequid Bay (with | |||||||
| no consideration for | ||||||||
| shoreline or water | Likely | Low | Very Low | Low | Anticipatory | Recurrent | Narrow | |
| potential far-field | ||||||||
| control structures | ||||||||
| effects) | ||||||||
| Mineral exploration | ||||||||
| or extraction | ||||||||
| Minas Basin/ | ||||||||
| activities in Minas | Unlikely | Unknown | Low | Unknown | Anticipatory | Continuous | Restricted | |
| Cobequid Bay | ||||||||
| Basin and nearby | ||||||||
| rivers | ||||||||
| Large-scale tidal | ||||||||
| Minas Basin | Unlikely | Unknown | Medium | Unknown | Anticipatory | Continuous | Broad | |
| energy extraction | ||||||||
| Other sources of | ||||||||
| pollution, including | Minas Basin/ | |||||||
| Remote | Unknown | Low | Unknown | Anticipatory | Single | Narrow | ||
| non-point source | Cobequid Bay | |||||||
| pollution | ||||||||
| Release of | ||||||||
| petroleum products | Minas Basin/ | |||||||
| Remote | Extreme | Medium | Low | Anticipatory | Recurrent | Extensive | ||
| in Gulf of Maine and | Cobequid Bay | |||||||
| Bay of Fundy |
Demos:
docxtractr R PackagePDFs seem to have some tricky issues consistently observed while
trying different solutions. What about Word Files? We saved Table 4 as a
separate Word file and tested docxtractr R Package.
screenshot of table in Word is displayed above.
The solution below still requires a bit more development but the simple extraction seems similar to the original table 4. Keeping tables as simple as possible in the word or PDF file is one of the main pieces to implement moving forward.
library(docxtractr)
tabletestdocx <- read_docx("table4.docx")
docx_tbl_count(tabletestdocx)
## [1] 1
tbls <- docx_extract_all_tbls(tabletestdocx)
tbls
## [[1]]
## # A tibble: 7 × 9
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Threat Geog… Like… Leve… Caus… Thre… Thre… Thre… Thre…
## 2 Climate change Mina… Known High Low High Curr… Cont… Exte…
## 3 Alteration of shoreline or wa… Cobe… Like… Low Very… Low Anti… Recu… Narr…
## 4 Mineral exploration or extrac… Mina… Unli… Unkn… Low Unkn… Anti… Cont… Rest…
## 5 Large-scale tidal energy extr… Mina… Unli… Unkn… Medi… Unkn… Anti… Cont… Broad
## 6 Other sources of pollution, i… Mina… Remo… Unkn… Low Unkn… Anti… Sing… Narr…
## 7 Release of petroleum products… Mina… Remo… Extr… Medi… Low Anti… Recu… Exte…
library(kableExtra)
tbls %>% kbl(caption = "Table 4") %>%
kable_classic(full_width = F, html_font = "Cambria")
|
Camelot Python PackageCamelot is a python package, installable from Pip that can extract tables from a PDF based on the lines between cells (Lattice). This helps improve the quality of the parsed data.
from camelot import read_pdf
doc_file_path = "Rs-PholadeTronqueeAtlMudPiddock-v00-2022Aug-Eng.pdf"
df_list = read_pdf(doc_file_path, pages="20", line_scale=50)
# this is a list, because there might be more than one table on the page
# print everything:
print(df_list[0].df.to_string())
## 0 1 2 3 4 5 6 7 8
## 0 Threat Geographic scale Likelihood of \noccurrence Level of \nimpact Causal \ncertainty Threat \nrisk Threat \noccurrence Threat \nfrequency Threat \nextent
## 1 Climate change Minas Basin/ \nCobequid Bay Known High Low High Current Continuous Extensive
## 2 Alteration of \nshoreline or water \ncontrol structures Cobequid Bay (with \nno consideration for \npotential far-field \neffects) Likely Low Very Low Low Anticipatory Recurrent Narrow
## 3 Mineral exploration \nor extraction \nactivities in Minas \nBasin and nearby \nrivers Minas Basin/ \nCobequid Bay Unlikely Unknown Low Unknown Anticipatory Continuous Restricted
## 4 Large-scale tidal \nenergy extraction Minas Basin Unlikely Unknown Medium Unknown Anticipatory Continuous Broad
## 5 Other sources of \npollution, including \nnon-point source \npollution Minas Basin/ \nCobequid Bay Remote Unknown Low Unknown Anticipatory Single Narrow
## 6 Release of \npetroleum products \nin Gulf of Maine and \nBay of Fundy Minas Basin/ \nCobequid Bay Remote Extreme Medium Low Anticipatory Recurrent Extensive
There are a couple potential issues with this method. Special characters are explicitly recorded (newlines are saved as ) and items such as graphs can be mistaken as tables.
Docx Python PackageDocx is a python package, installable from Pip
(pip install python-docx) that can extract tables from word
documents.
from docx import Document
recovery_docx = Document("Rs-PholadeTronqueeAtlMudPiddock-v00-2022Aug-Eng.docx")
# recovery_docx.tables is a list containing all of the tables in the docx, print the 3rd one:
for row in recovery_docx.tables[3].columns:
for cell in row.cells:
print(cell.text)
## Habitat type
## Tidal pools
## Boulders
##
## Capstone
## Rivulets
##
## Patches
## Habitat use
## In the bottom of tidal pools with water >0.5 cm deep and an area greater than a patch (>1.0 m2)
## Around the base of large cobbles or boulders where tidal scour causes sediment flushing
## Adjacent to or under resistant capstone that protects underlying mudstone against scour from ice and other materials
## In rivulets or channels where freshwater runoff or the tide causes sediment flushing
## In patches in the intertidal zone covered by water <0.5 cm deep and an area smaller than a tidal pool (<1.0 m2)
This method produces a very reliable output that is fairly easy to work with.
Demos:
Azure Form Recognizer is a cloud-based Azure Applied AI Service that analyzes forms and documents, extracts text and data, and maps field relationships as key-value pairs: https://formrecognizer.appliedai.azure.com/studio
Azure Form Recognizer could potentially allow coordination with tools withing our Microsoft Suite. It is powered by Python.
Azure Form Recognizer does not exist in our current suite of tools. We logged in a ticket with IT and they confirmed that in order to complete our request we would need you fill out a form so that software can be evaluated by IT security before installing to make sure it is safe to have it on our network. It does not seem to be an open source solution (0-500 pages are free per month, after that, there is a cost associated with using this tool https://azure.microsoft.com/en-ca/pricing/details/form-recognizer/)
We have not tested this option yet as there seems to be a price tag associated with number of pages depending on the capacity of data mining.
Demos:
DiagrammeR::grViz("digraph {
graph [layout = dot, rankdir = TB]
node [shape = rectangle]
rec1 [label = 'PDF Recovery Documents']
rec2 [label = 'Create Word Recovery Document using Foxit']
rec3 [label = 'Identify key terms of tables to be extracted in the Word document']
rec4 [label = 'Test extraction of relevant tables']
rec1 -> rec2 -> rec3 -> rec4
}",
height = 500)
Word files seem to be providing a really nice output. Our next task was to explore which tables specifically will be mined. For this, we manually compared the spreadsheet information shared by the SARA Program and the source document/word files for Atlantic Mud-piddock and Blue whale.
| Column Name in Word Doc Table | Column Name in Spreadsheet |
|---|---|
| Species | Species Name |
| Threat | Not included |
| Geographic scale | Not included |
| Likelihood of occurrence | Not included |
| Level of impact | Not included |
| Causal certainty | Not included |
| Threat risk | Not included |
| Threat occurrence | Not included |
| Threat frequency | Not included |
| Threat extent | Not included |
| General description of research and management approaches | Recovery Measures/ Conservation Measures |
| Priority | Not included |
| Broad strategy | Not included |
| Threat or concern addressed | Not included |
| Column Name in Spreadsheet | Column Name in Word Doc Table |
|---|---|
| Designatable Unit (DU) | Not included |
| Taxon | Not included |
| COSEWIC Status | Not included |
| SARA Status | Not included |
| Lead Region | Not included |
| Broad Strategy | Not included |
| Document Used | Not included |
| Document Reference | Not included |
| Column name in Spreadsheet Tab Recovery Measure | Column name in Spreadsheet Tab Recovery Measure Tracking |
|---|---|
| Recovery Measures/ Conservation Measures | AP: Recovery Measures |
We tested the process of adding a potential new table in the Recovery Documents/Action Plans that could be added in future word files with information that could be mined into an authoritative spreadsheet.
Proposed new table
We also modified and standardized names of the additional columns to be mined from the word documents into the spreadsheets. These are tentative names until SARA Program decides which names could be used systematically in all future documents. This will make the process of mining tables significantly simpler as the codes will be looking for this unique names within tables. For this test, we did the following changes in the two word documents:
The above mentioned changes have led to the following key words to mine the two test word documents:
Note: solutions will not be case sensitive.
Mining tables from Word files proved to be the most effective solution using the least amount of coding and manipulation. We briefly tested one of the documents with the new proposed tables.
library(docxtractr)
testdocx <- read_docx("Rs-PholadeTronqueeAtlMudPiddock-v00-2022Aug-Eng.docx")
docx_tbl_count(testdocx)
## [1] 25
alltables <- docx_extract_all_tbls(testdocx)
str(alltables)
## List of 25
## $ : tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
## ..$ Species.Name : chr "Atlantic Mud-piddock"
## ..$ Designatable.Unit: chr "NA"
## ..$ Taxon : chr "Marine Mollusc"
## ..$ COSEWIC.Status : chr "TH"
## ..$ SARA.Status : chr "TH"
## ..$ Lead.Region : chr "Maritimes"
## ..$ Document.Used : chr "Recovery Strategy"
## $ : tibble [2 × 7] (S3: tbl_df/tbl/data.frame)
## ..$ Kingdom: chr [1:2] "Animalia" "Animals"
## ..$ Phylum : chr [1:2] "Mollusca" "Clams, snails, octopuses, squid, scallops, oysters, andchitons"
## ..$ Class : chr [1:2] "Bivalva" "Marine and freshwater molluscs with a two- part hingedshell"
## ..$ Order : chr [1:2] "Myida" "Burrowing salt water clams"
## ..$ Family : chr [1:2] "Pholadidae" "Piddocks"
## ..$ Genus : chr [1:2] "Barnea" "Distinguished from other Pholadidae by morphological characteristics"
## ..$ Species: chr [1:2] "Barnea truncata" "Atlantic Mud- piddock"
## $ : tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ X : chr [1:2] "Core sites" "Peripheral sites"
## ..$ Stable.sites : chr [1:2] "Tennycape, Burntcoat Head, Noel Bay, Sloop Rocks, Shad Creek, and MungoBrook" "Spencer Point and Port Williams"
## ..$ Unstable.sites: chr [1:2] "N/A" "Parrsboro, Kingsport, Evangeline Beach, Economy Point, and Five Islands"
## $ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Habitat.type: chr [1:5] "Tidal pools" "Boulders" "Capstone" "Rivulets" ...
## ..$ Habitat.use : chr [1:5] "In the bottom of tidal pools with water >0.5 cm deep and an area greater than a patch (>1.0 m2)" "Around the base of large cobbles or boulders where tidal scour causes sediment flushing" "Adjacent to or under resistant capstone that protects underlying mudstone against scour from ice and other materials" "In rivulets or channels where freshwater runoff or the tide causes sediment flushing" ...
## $ : tibble [6 × 9] (S3: tbl_df/tbl/data.frame)
## ..$ Threat : chr [1:6] "Climate change" "Alteration of shoreline or water control structures" "Mineral exploration or extraction activities in Minas Basin and nearby rivers" "Large-scale tidal energy extraction" ...
## ..$ Geographic.scale : chr [1:6] "Minas Basin/ Cobequid Bay" "Cobequid Bay (with no consideration for potential far-field effects)" "Minas Basin/ Cobequid Bay" "Minas Basin" ...
## ..$ Likelihood.of.occurrence: chr [1:6] "Known" "Likely" "Unlikely" "Unlikely" ...
## ..$ Level.of.impact : chr [1:6] "High" "Low" "Unknown" "Unknown" ...
## ..$ Causal.certainty : chr [1:6] "Low" "Very Low" "Low" "Medium" ...
## ..$ Threat.risk : chr [1:6] "High" "Low" "Unknown" "Unknown" ...
## ..$ Threat.occurrence : chr [1:6] "Current" "Anticipatory" "Anticipatory" "Anticipatory" ...
## ..$ Threat.frequency : chr [1:6] "Continuous" "Recurrent" "Continuous" "Continuous" ...
## ..$ Threat.extent : chr [1:6] "Extensive" "Narrow" "Restricted" "Broad" ...
## $ : tibble [1 × 9] (S3: tbl_df/tbl/data.frame)
## ..$ Threat : chr "Recreation and adventure sport activities"
## ..$ Geographic.scale : chr "Minas Basin/ Cobequid Bay"
## ..$ Likelihood.of.occurrence: chr "Known"
## ..$ Level.of.impact : chr "Low"
## ..$ Causal.certainty : chr "Medium"
## ..$ Threat.risk : chr "Low"
## ..$ Threat.occurrence : chr "Current"
## ..$ Threat.frequency : chr "Recurrent"
## ..$ Threat.extent : chr "Restricted"
## $ : tibble [10 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ X. : chr [1:10] "1" "2" "3" "4" ...
## ..$ Recovery.measures : chr [1:10] "1. Monitor, and refine knowledge of, threats to the species" "2. Research and monitor sedimentation regimes in the Minas Basin and at Atlantic Mud-piddock habitat sites, inc"| __truncated__ "3. Establish and implement a monitoring program to survey distribution of available habitat and live individual"| __truncated__ "4. Increase understanding of life history characteristics (for example, age structure, recruitment, mortality)" ...
## ..$ Priority5 : chr [1:10] "Medium" "Medium" "Medium" "Low" ...
## ..$ Broad.strategy : chr [1:10] "Research and monitoring" "Research and monitoring" "Research and monitoring" "Research and monitoring" ...
## ..$ Threat.or.concern.addressed: chr [1:10] "All threats" "Sedimentation, climate change" "Habitat availability, population abundance and distribution" "Knowledge gaps" ...
## $ : tibble [4 × 4] (S3: tbl_df/tbl/data.frame)
## ..$ General.description.of.research.and.management.approaches: chr [1:4] "" "" "" ""
## ..$ Priority5 : chr [1:4] "" "" "" ""
## ..$ Broad.strategy : chr [1:4] "" "" "" ""
## ..$ Threat.or.concern.addressed : chr [1:4] "" "" "" ""
## $ : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
## ..$ Life.stage : chr [1:3] "Egg" "Trochophore larva" "Veliger larva"
## ..$ Function6 : chr [1:3] "Egg development and growth" "Growth" "Settlement"
## ..$ Feature.s.7 : chr [1:3] "Water column in intertidal and subtidal zones" "Water column in intertidal and subtidal zones" "Water column in intertidal and subtidal zones"
## ..$ Attribute.s.8: chr [1:3] "Salinity in intertidal zone: 5 to 25 pptSalinity in subtidal zone9: 20.0 to29.5 pptTemperature and oxygen level"| __truncated__ "Salinity in intertidal zone: 5 to 25 pptSalinity in subtidal zone: 20.0 to29.5 pptTemperature and oxygen levels"| __truncated__ "Salinity in intertidal zone: 5 to 25 pptSalinity in subtidal zone: 20.0 to29.5 pptTemperature and oxygen levels"| __truncated__
## $ : tibble [5 × 4] (S3: tbl_df/tbl/data.frame)
## ..$ Life.stage : chr [1:5] "" "Veliger larva" "Sub-adult" "Sub-adult" ...
## ..$ Function6 : chr [1:5] "" "Settlement" "Boring and growth" "Boring and growth" ...
## ..$ Feature.s.7 : chr [1:5] "" "Red mudstone in the intertidal zone associated with the following habitat types: tide pools, rivulets, boulders"| __truncated__ "Red mudstone in the intertidal zone associated with the following habitat types: tide pools, rivulets, boulders"| __truncated__ "Water column in the intertidal zone" ...
## ..$ Attribute.s.8: chr [1:5] "High slack tide for settlement" "Available red mudstone surfaces to settle upon" "Available red mudstone of sufficient depth (~5 cm) within which to bore and matureNo rapid or significant accum"| __truncated__ "Salinity: 5 to 25 pptTemperature and oxygen levels within natural range of variation (0° to 21°C and ~5 mg/L)Su"| __truncated__ ...
## $ : tibble [4 × 4] (S3: tbl_df/tbl/data.frame)
## ..$ Life.stage : chr [1:4] "" "Adult" "Adult" "All (except egg)"
## ..$ Function6 : chr [1:4] "" "Reproduction" "Reproduction" "Feeding"
## ..$ Feature.s.7 : chr [1:4] "Water column in the intertidal zone" "Red mudstone in the intertidal zone associated with the following habitat types: tide pools, rivulets, boulders"| __truncated__ "Water column in the intertidal zone" "Food supply"
## ..$ Attribute.s.8: chr [1:4] "Salinity: 5 to 25 pptTemperature and oxygen levels within natural range of variation (0° to 21°C and ~5 mg/L)No"| __truncated__ "Available mudstone from which adults can release eggs and spermNo rapid or significant accumulation of sediments" "Salinity: 5 to 25 pptOxygen levels within natural range of variation (~5 mg/L)Suitable temperature for external"| __truncated__ "Sufficient quality and quantity of food (plankton and particulates)"
## $ : tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ Description.of.study: chr [1:4] "Habitat mapping via foot surveys to determine the amount of available and occupied habitat at inhabited and extirpated sites." "Habitat mapping using unmanned aerial vehicle (UAV) and high accuracy survey-grade GPS technology to identify p"| __truncated__ "Conduct research on spawning to determine timing, concurrent water temperature, and other environmental determinants." "Conduct research and modelling to refine understanding of temporal, spatial, and environmental features and att"| __truncated__
## ..$ Rationale : chr [1:4] "This will allow for comparisons of the extent of available and occupied habitat at known Atlantic Mud- piddock "| __truncated__ "This work will supplement foot surveys to inform characterization of habitat availability as either stable, inc"| __truncated__ "The habitat attributes required for spawning are unknown. This study is expected to improve understanding of ha"| __truncated__ "Little is known about the habitat requirements and movement of larvae in the Minas Basin. Modelling larval dist"| __truncated__
## ..$ Timeline10 : chr [1:4] "5 years and ongoing" "5 years and ongoing" "5 years" "5 years"
## $ : tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
## ..$ Threat : chr "Climate change"
## ..$ Activity : chr "Global greenhouse gas emissions from human activities"
## ..$ Effect..pathway : chr "Increase in the frequency, length, and severity of storms and rainfall, causing disruption to sediment depositi"| __truncated__
## ..$ Function.s..affected : chr "Egg development and growthTrochophore larva growthVeliger larva settlementSub-adult boring and growthAdult grow"| __truncated__
## ..$ Feature.s..affected : chr "Water in intertidal and subtidal zonesRed mudstone in the intertidal zone where water is present at high and lo"| __truncated__
## ..$ Attribute.s..affected: chr "Temperature and oxygen levels within natural range of variation (0° to 21°C and ~5 mg/L)Suitable temperature fo"| __truncated__
## $ : tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
## ..$ Threat : chr [1:2] "Alteration of shoreline or water control structures" "Mineral exploration or extraction activities in Minas Basin and nearby rivers"
## ..$ Activity : chr [1:2] "DredgingConstruction of coastal infrastructure (for example, marina, wharf)Installation of shoreline- modifying"| __truncated__ "Dredging"
## ..$ Effect..pathway : chr [1:2] "Current alteration, causing reduced red mudstone availability or smothering through siltation events" "Disruption of sediments, causing reduced red mudstone availability or smothering through sedimentation events"
## ..$ Function.s..affected : chr [1:2] "Veliger larva settlementSub-adult boring and growthAdult growthAdult reproduction" "Veliger larva settlementSub-adult boring and growthAdult growthAdult reproduction"
## ..$ Feature.s..affected : chr [1:2] "Red mudstone in the intertidal zone where water is present at high and low tide and is associated with the foll"| __truncated__ "Red mudstone in the intertidal zone where water is present at high and low tide and is associated with the foll"| __truncated__
## ..$ Attribute.s..affected: chr [1:2] "Available red mudstone surfaces upon which to settle, and within which to bore, mature, and reproduceNo rapid o"| __truncated__ "Available red mudstone surfaces upon which to settle, and within which to bore, mature, and reproduceNo rapid o"| __truncated__
## $ : tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
## ..$ Threat : chr [1:2] "Large-scale tidal energy extraction" "Sources of pollution, including point and non-point source pollution"
## ..$ Activity : chr [1:2] "Deployment and operation of tidal power turbines within the Minas Basin for energy extraction" "Agricultural activities leading to run- off (for example, pesticides, animal waste)Urban activities leading to "| __truncated__
## ..$ Effect..pathway : chr [1:2] "Alteration of tidal, and thus sedimentation regimes within the Minas Basin, reducing red mudstone availability "| __truncated__ "Introduction of pollutants, resulting in reduction in water quality and food availability"
## ..$ Function.s..affected : chr [1:2] "Veliger larva settlementSub-adult boring and growthAdult growthAdult reproductionLarval, sub-adult, and adult feeding" "Egg development and growthTrochophore larva growthVeliger larva settlementSub-adult boring and growthAdult grow"| __truncated__
## ..$ Feature.s..affected : chr [1:2] "Red mudstone in the intertidal zone where water is present at high and low tide and is associated with the foll"| __truncated__ "Water in intertidal and subtidal zones"
## ..$ Attribute.s..affected: chr [1:2] "No rapid or significant accumulation of sedimentsAvailable red mudstone surfaces upon which to settle, and with"| __truncated__ "Oxygen levels within natural range of variation (0° to 21°C and ~5 mg/L)Sufficient water qualitySufficient qual"| __truncated__
## $ : tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
## ..$ Threat : chr "Recreation and adventure sport activities"
## ..$ Activity : chr "Running eventsBiking eventsAll terrain- vehicle use"
## ..$ Effect..pathway : chr "Frequent or intense site use resulting in degradation of substrate and reduced red mudstone availability"
## ..$ Function.s..affected : chr "Veliger larva settlementSub-adult growth and boringAdult growthAdult reproduction"
## ..$ Feature.s..affected : chr "Red mudstone in the intertidal zone where water is present at high and low tide and is associated with the foll"| __truncated__
## ..$ Attribute.s..affected: chr "Available red mudstone surfaces upon which to settle, bore, mature, and reproduceRed mudstone of sufficient dep"| __truncated__
## $ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Likelihood.of.occurrence: chr [1:5] "Known or very likely to occur" "Likely to occur" "Unlikely" "Remote" ...
## ..$ Definition : chr [1:5] "This threat has been recorded to occur 91 to 100%" "There is 51 to 90% chance that this threat is or will be occurring." "There is 11 to 50% chance that this threat is or will be occurring" "There is 1 to 10% or less chance that this threat is or will be occurring." ...
## $ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Level.of.impact: chr [1:5] "Extreme" "High" "Medium" "Low" ...
## ..$ Definition : chr [1:5] "Severe population decline (for example, 71 to 100%) with the potential for extirpation." "Substantial loss of population (31 to 70%) orThreat would jeopardize the survival or recovery of the population." "Moderate loss of population (11 to 30%) orThreat is likely to jeopardize the survival or recovery of the population." "Little change in population (1 to 10%) orThreat is unlikely to jeopardize the survival or recovery of the population." ...
## $ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Causal.certainty: chr [1:5] "Very high" "High" "Medium" "Low" ...
## ..$ Definition : chr [1:5] "Very strong evidence that threat is occurring and the magnitude of the impact to the population can be quantified." "Substantial evidence of a causal link between threat and population decline or jeopardy to survival or recovery" "There is some evidence linking the threat to population decline or jeopardy to survival or recovery" "There is a theoretical link with limited evidence that threat is leading to a population decline or jeopardy to"| __truncated__ ...
## $ : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Threat.occurrence: chr [1:3] "Historical" "Current" "Anticipatory"
## ..$ Definition : chr [1:3] "A threat that is known to have occurred in the past and negatively impacted the population." "A threat that is ongoing, and is currently negatively impacting the population." "A threat that is anticipated to occur in the future, and will negatively impact the population."
## $ : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Threat.frequency: chr [1:3] "Single" "Recurrent" "Continuous"
## ..$ Definition : chr [1:3] "The threat occurs once." "The threat occurs periodically, or repeatedly." "The threat occurs without interruption."
## $ : tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Threat.extent: chr [1:4] "Extensive" "Broad" "Narrow" "Restricted"
## ..$ Definition : chr [1:4] "71 to 100% of the population is affected by the threat." "31 to 70% of the population is affected by the threat." "11 to 30% of the population is affected by the threat." "1 to 10% of the population is affected by the threat."
## $ : tibble [35 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ Water.feature.name: chr [1:35] "Avon River" "Avon River" "Bass Creek (Cambridge)" "Bass Creek (Cambridge)" ...
## ..$ Latitude : chr [1:35] "45° 6' 31.324\" N" "45° 7' 7.156\" N" "45° 11' 57.748\" N" "45° 11' 57.000\" N" ...
## ..$ Longitude : chr [1:35] "64° 13' 26.971\" W" "64° 11' 55.178\" W" "64° 8' 3.999\" W" "64° 8' 2.953\" W" ...
## $ : tibble [41 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ Water.feature.name: chr [1:41] "Gr Great Village River eat Village River" "Habitant Creek" "Habitant Creek" "Harrington River" ...
## ..$ Latitude : chr [1:41] "45° 23' 40.765\" N" "45° 9' 3.393\" N" "45° 8' 41.817\" N" "45° 24' 38.152\" N" ...
## ..$ Longitude : chr [1:41] "63° 36' 37.443\" W" "64° 22' 49.369\" W" "64° 22' 47.492\" W" "64° 5' 56.519\" W" ...
## $ : tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ Water.feature.name: chr [1:24] "Portapique River 1" "Portapique River 1" "Portapique River 2" "Portapique River 2" ...
## ..$ Latitude : chr [1:24] "45° 23' 35.228\" N" "45° 23' 38.577\" N" "45° 23' 37.432\" N" "45° 23' 39.398\" N" ...
## ..$ Longitude : chr [1:24] "63° 42' 25.630\" W" "63° 42' 27.379\" W" "63° 42' 53.690\" W" "63° 42' 52.239\" W" ...
library(kableExtra)
alltables[[1]] %>% kbl(caption = "Testing Full Extraction of Tables") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Species.Name | Designatable.Unit | Taxon | COSEWIC.Status | SARA.Status | Lead.Region | Document.Used |
|---|---|---|---|---|---|---|
| Atlantic Mud-piddock | NA | Marine Mollusc | TH | TH | Maritimes | Recovery Strategy |
DFO staff is tasked with mining tables from Recovery Documents PDF/Word files to organized them in a nice and clean spreadsheet for future reporting. Below are potential future scenarios of what their experience could be using different approaches.
Staff mine tables that have weird double lines, non matching lines and background white lines separating the cells. Significant amount of manual intervention, extra coding and tweaking is required for each table in a customized fashion. This scenario is a bit cumbersome and requires extra manipulation.
Using a Reproducible Analytical Pipeline, staff mine tables directly from improved Word files provided by the authors, or generated from the authoritative PDFs using software like Foxit. Improved Word documents follow explicit guidance to create metadata (a.k.a. tables) that are mined at the same time as the filepath when scraping the document. Staff mine tables efficiently and easily because authors rigorously applied the following recommendations for each table:
Mining PDF and Word files is a possibility. Is there a better approach for mining data and storing it in an authoritative place? Data could be mined from something different than a PDF/Word file. Something such as database, excel sheets, GUI, which could be populated as part of a submission. In this case, standardized tables are not a deal breaker for mining information.
The evaluation of any scenario must take into account the time and user experience of i) folks writing/publishing SARA documents, ii) folks mining these documents, iii) Staff needed to maintain new tools.
Need a bit of cleaning before interactive report pilot - happy to do cleaning with recommendations on how to keep spreadsheets as clean as possible for coding (Excel spreadsheets, CSV, Databases)
Conceptual workflow of elements that would be required to reverse-engineer outcomes 1 and 2 (above) by using forms, csv files, or Excel spreadsheets to generate reproducible reports (e.g. using Microsoft PowerBI, R Markdown, or other tools that are easily used and accessible to Species at Risk Program staff).
Option2
Proposed Milestones:
To develop an interactive Power BI tool to play with outputs of the spreadsheets. This interactive outputs can inform reporting, and can support SARA staff explore questions on the fly, using the most up-to-date database or datafame.
To explore the feasibility of creating a Word skeleton of a SARA document that will pre-populate as much information as possible in a reproducible manner. To tackle this, we will explore options were users may not need to interact with code directly, but can take advantage of a powerful pipeline.
While there is a steep learning curve to building this pipeline, documents will be much easier to update following the addition of new data as all tables and figures should automatically update. This workflow will also improve the transparency and reproducibility of the work, especially if changes are tracked using git.
To test this idea, we will aim to create a package called SARAdown, which will generate a skeleton word file with information pre-poulated for single-species or multi-species reports. We will test this pilot using the template under development by the SARA program: https://086gc.sharepoint.com/:w:/r/sites/MaritimesSpatialPlanning-MAROpenDataLogistics/_layouts/15/Doc.aspx?action=edit&sourcedoc=%7B5a7c2d52-5121-454f-84d1-947f37bf0658%7D&wdOrigin=TEAMS-ELECTRON.teamsSdk.openFilePreview&wdExp=TEAMS-CONTROL
This project will draw ideas and code from the following packages: