Using the srUtils package

Author

Julian Flowers

Published

September 28, 2022

Motivation

This note introduces the srUtils package which is a very early iteration of a set of functions designed to assist undertaking systematic mapping exercises in ecology.

An essential part of systematic literature review is extraction of key information in coded format.

Some information can be readily extracted using regular expressions -functions are available for some of these. Other kinds of information extraction requires more sophisticated analysis using natural language processing (nlp) - a form of AI. Using nlp we can extract locations, chemicals and taxa. This is more complex and functions are not yet available for this package although a walk through is available.

Functions

Currently the package has 4 functions all in the format get_x.

These are:

  • get_air_temperature - this identifies expressions of the form degrees C

  • get_chemical - this identifies simple expressions like N2O, CO2 and the term greenhouse gases1

  • get_elevation - this identifies elevations i.e. points above sea level, in texts

  • get_coordinates - this detects latitudes and longitudes in various formats including polar and decimal coordinates and eastings and northings.

Getting started

The package is not on CRAN but can be installed from Github using devtools or remotes using the code below.

Code
if(!require("remotes")) install.packages("remotes")

remotes::install_github("julianflowers/srUtils")

library(srUtils)

To use srUtils the tidyverse package is very useful, as is readtext for reading in large numbers of text files (e.g. pdfs or word docs).

Code
library(tidyverse, quietly = TRUE); library(readtext)

Usage

To show how to use the package we will load some articles in pdf format.

Code
path <- here::here("/Users/julianflowers/Library/Mobile Documents/com~apple~CloudDocs/Desktop/herbivores_ghg/my_corpus")

pdfs <- list.files(path, "pdf$", full.names = TRUE)

pdf_text <- map_dfr(pdfs, readtext)

Get elevation

Code
el <- get_elevation(pdf_text$text[5]) 

paste(pdf_text$doc_id[5], el$value[[1]][3], el$value[[1]][4])
[1] "1-s2.0-S001670611200119X-main.pdf 52 m"
Code
air_temp <- get_air_temperature(pdf_text$text[5]) 

air_temp |>
  unnest("value")
# A tibble: 11 × 2
    name value[,1]                                             [,2]  [,3]  [,4] 
   <int> <chr>                                                 <chr> <chr> <chr>
 1     1 "22 °C. More than "                                   22    °     C    
 2     1 "10 °C. As grass requires a minimum temperature >"    10    °     C    
 3     1 "300 °C. Gases of known N"                            300   °     C    
 4     1 "11 °C and WFPS was "                                 11    °     C    
 5     1 "22 °C. Such environmental\nMATLAB (Math works USA, " 22    °     C    
 6     1 "30\n                                          Soil … 30    °     C    
 7     1 "17 °C to be suit-\n                                … 17    °     C    
 8     1 "20 °C) and an N application on\n                   … 20    °     C    
 9     1 "18 °C) and\n                                       … 18    °     C    
10     1 "16 °C and                           were observed b… 16    °     C    
11     1 "18 °C and there was an N application on June       … 18    °     C    
Code
chem <- get_chemical(pdf_text$text[5])

chem |>
  unnest("value") |>
  distinct()
# A tibble: 10 × 2
    name value[,1]                                        
   <int> <chr>                                            
 1     1 "N2O"                                            
 2     1 "nitro"                                          
 3     1 "nitrog"                                         
 4     1 "NO"                                             
 5     1 "carbon"                                         
 6     1 "     Monthly N2O flux"                          
 7     1 "NO2"                                            
 8     1 "CO"                                             
 9     1 "greenhouse gases between terrestrial ecosystems"
10     1 "greenhouse effect."                             
Code
coords <- get_coordinates(pdf_text$text[5])

coords |>
  unnest("value") |>
  distinct()
# A tibble: 2 × 2
   name value[,1] [,2]  [,3]  [,4]  [,5] 
  <int> <chr>     <chr> <chr> <chr> <chr>
1     1 52°07 N   52    °     07    N    
2     1 08°16 W   08    °     16    W    

Scaling up

We can apply these functions to multiple papers at the same time using the map functions from the purrr pacakage (part of the tidyverse.

Code
coords2 <- pdf_text |>
  mutate(coords = map(text, get_coordinates))

coords3 <- coords2 |>
  unnest("coords") |>
  unnest_auto("value") 

coords3$value |>
  as.matrix() |>
  data.frame() 
                         X1     X2 X3   X4 X5
1                   52°07 N     52  °   07  N
2                   08°16 W     08  °   16  W
3                  52° 8′ N     52  °    8  N
4                 54° 20′ N     54  °   20  N
5                  8° 19′ W      8  °   19  W
6           568570 N, 4835W 568570  N 4835  W
7         458380 N, 28440 E 458380  N 8440  E
8                  54.18◦ N     54 18    ◦  N
9                   2.36◦ E      2 36    ◦  E
10             56◦ 29\004 N     56  ◦   29  N
11              9◦ 34\004 E      9  ◦   34  E
12                  53°13′N     53  °   13  N
13            4°0ʹW; Fig. S      4  °    0  S
14                    4°0ʹW      4  °    0  W
15                  52°18′N     52  °   18  N
16                   6°30′W      6  °   30  W
17                  50°45′N     50  °   45  N
18                   3°50′W      3  °   50  W
19                 52.30◦ N     52 30    ◦  N
20                  6.40◦ W      6 40    ◦  W
21                 52◦ 31’N     52  ◦   31  N
22                  0◦ 23’E      0  ◦   23  E
23 1000 N, 3\001 540 0500 W   1000  N 0500  W
24        528160 N, 88250 W 528160  N 8250  W
25                53° 30’ N     53  °   30  N
26                 6° 10’ E      6  °   10  E
27                 55◦ 52’N     55  ◦   52  N
28                 03◦ 02’W     03  ◦   02  W
29                 51° 46'N     51  °   46  N
30                  9° 42'E      9  °   42  E
31                N54:41:18      N 54   41 18
32                 W2:22:45      W  2   22 45
33      8199N, 3u4490.8199W   8199  N 8199  W
34                 10.17 °W     10 17    °  W
35                 10.12 °W     10 12    °  W
36                 10.17 °W     10 17    °  W
37                  9.59 °W      9 59    °  W
38                  9.55 °W      9 55    °  W
39                  9.40 °W      9 40    °  W
40                  9.43 °W      9 43    °  W
41                  9.38 °W      9 38    °  W
42                 51.47 °N     51 47    °  N
43                 51.36 °N     51 36    °  N
44                 51.58 °N     51 58    °  N
45                 51.58 °N     51 58    °  N
46                 51.44 °N     51 44    °  N
47                 51.37 °N     51 37    °  N
48                 51.35 °N     51 35    °  N
49                 51.35 °N     51 35    °  N

Footnotes

  1. More comprehensive and accurate detection of chemicals can be performed using hunflair - a language model which is part of the flair python package. There is not yet an easy to use R implementation for this but a tutorial is available here↩︎