This process is not for the faint-hearted but gives a rapid way of extracting taxa from texts which can enable processing of large numbers of texts to identify similar species in research or facilitate coding for systematic maps. It only needs to be run through once. It uses a state-of-the-art model in terms of accuracy to extract the information Le Guillarme (2022); Le Guillarme (2022); Ushey, Allaire, and Tang (2022); Le Guillarme and Thuiller (2021)
Uses python language models so we need to install Python and use reticulate package in R which allows R and Python to talk to each other.
Also downloads large files so setup time can be quite long
If your computer has an NVIDIA graphics card, you should be able to run CUDA which will allow you to use the GPU (graphics processing unit) for processing. This is much quicker than the CPU and can speed up resource intensive tasks like Natural Language Processing.
To check:
open windows Device Manager
scroll down to Display Adapters and open to see what kind of graphics card you have
check your card against this list - if it is there you can use CUDA.
linker <-"ncbi_lite"ner <-init.taxonerd("en_ner_eco_biobert", abbrev =TRUE, sent =FALSE, link = linker, thresh = .7, gpu =TRUE)
Get text to process
taxonerd offers functions for extracting taxa from blocks of text, files, or corpora (sets of files) but the file and corpora options don’t seem to be functioning properly.
To extract per pdf text I use readtext as follows
# set path to location of pdfspath <- here::here("G:\\My Drive\\my_corpus\\uploads")pdfs <-list.files(path, "pdf$", full.names =TRUE)pdf_text <- purrr::map_dfr(pdfs, readtext::readtext)
PDF error: Invalid shared object hint table offset
PDF error: xref num 1590 not found but needed, try to reconstruct<0a>
Le Guillarme, Nicolas. 2022. “Taxonerd: Taxonomic Named Entity Recognition Using Deep Models.”
Le Guillarme, Nicolas, and Wilfried Thuiller. 2021. “TaxoNERD: Deep Neural Models for the Recognition of Taxonomic Entities in the Ecological and Evolutionary Literature.”Methods in Ecology and Evolution 13 (3): 625–41. https://doi.org/10.1111/2041-210x.13778.