Extracting lat longs from published articles

Introduction

Ecological field studies usually report site location or range.

This note explains how to programmatically extract lat-longs and elevations from pdf publications for further processing (e.g. mapping).

One of the challenges is that there is not consistent reporting pattern so this is based on analyses of 45 documents which form part of the herbivory climate change review.

Method

I outline 3 steps:

  1. Reading pdfs into R
  2. Designing the code to identify lat longs on the extracted text - this uses a process called regular expressions
  3. Onward processing of the extracted lat longs into decimal coordinated (if they are not already in decimal format).

Reading pdfs into R

There are 2 R packages widely used for reading text and pdf files into R - readtext and pdftools.

In this note I will use readtext

I’ve already created a directory to store the pdfs I want to extract the information from.

Code
library(pacman)  ## package manager
p_load(tidyverse, readtext, sf, mapview, leaflet, leafpop) ## installs and loads packages if not already installed or loaded

It’s really easy to use readtext

Code
p <- here::here("~/Downloads/herbivory-corpus")   ## point R at the pdf directory 

f <- list.files(p, "pdf", full.names = T)         ## get a list of files

df <- map_dfr(f, readtext)                        ## iterate through file list, read in the text and create a data frame

df                                       ## look at first 6 rows of dataframe
readtext object consisting of 22 documents and 0 docvars.
# Description: df [22 × 2]
  doc_id                            text               
  <chr>                             <chr>              
1 1-s2.0-0038071795001867-main.pdf  "\"          \"..."
2 1-s2.0-S0016706109001487-main.pdf "\"          \"..."
3 1-s2.0-S0016706114001761-main.pdf "\"          \"..."
4 1-s2.0-S0048969714017100-main.pdf "\"          \"..."
5 1-s2.0-S0048969718320837-main.pdf "\"          \"..."
6 1-s2.0-S0048969719337271-main.pdf "\"          \"..."
# … with 16 more rows

There are 2 columns - the file name of document, and the text effectively stored in a cell in the dataframe.

We can now generate a piece of text (regular expression) to try and match lat-longs in the documents.

  • \\d{1,2}◦\\s*\\d{1,2}.*[NSEW] matches 20◦45’N. Note that in some pdfs what is printed as ° actually turns out to be ◦ when the pdf text is extracted

  • \\d{1,2}°\\s*\\d{1,2}.*[NSEW] matches 20°45’N.

  • \\d{1,2}\\.\\d{1,2}◦.*[NSEW] matches 20.45◦W

  • \\d{1,2}\\.\\d{1,2}°*[NSEW] matches 20.45°E

  • \\d{1,2}:\\d{1,2}:\\d{1,2}.*[NSEW] matches 20:45:11 W

Sometimes to compass point is at the beginning rather than the end of the expression.

Regular expressions are an essential bit of coding needed to extract information from text but looks like gibberish.

Elevations

For elevations a quick review of the papers suggest that there are a number of text patterns

  • elevation

  • above see level

  • a.s.l

Lets use these to try and extract elevations from text.

Potential text patterns are.

  • “\\d{1,}.*elevation|above sea level|a.s.l” find a number with at least 1 digit followed by the words elevation OR a.s.l OR above sea level

  • “elevation|above sea level|a.s.l.*\\d{1,}” find the words elevation OR a.s.l OR above sea level followed by a number with 1 or more digits

Code
df$text |>
  str_extract("\\d{1,}.*(elevation|a\\.s\\.l.|above sea Level).*\\d{1,}")
 [1] NA                                                                                                  
 [2] NA                                                                                                  
 [3] NA                                                                                                  
 [4] NA                                                                                                  
 [5] NA                                                                                                  
 [6] NA                                                                                                  
 [7] NA                                                                                                  
 [8] NA                                                                                                  
 [9] NA                                                                                                  
[10] NA                                                                                                  
[11] NA                                                                                                  
[12] NA                                                                                                  
[13] NA                                                                                                  
[14] "270 m a.s.l.; 53°13′N, 4°0ʹW; Fig. S1). The field (11.5"                                            
[15] "1997; Soussana                      buffer zone and checked for comparable elevation within \00510"
[16] "2O dynamics in the soil                 elevation of 56 m a.s.l, a mean annual rainfall of 824"    
[17] "17 mm in June). Mean elevation was 567 ± 4 m and slope was 6 ± 3"                                  
[18] NA                                                                                                  
[19] NA                                                                                                  
[20] NA                                                                                                  
[21] NA                                                                                                  
[22] NA                                                                                                  

Lets apply these patterns to our texts

Finding lat-longs

Code
# pattern_lat <- "\\d{1,2}◦\\s*\\d{1,2}.*[NS]|\\d{1,2}\\.\\d{1,2}◦.*[NS]|\\d{1,2}\\.\\d{1,2}°*[NS]|[NS]?\\d{1,2}:\\d{1,2}:\\d{1,2}?"
# 
# pattern_long <- "\\d{1,2}◦\\s*\\d{1,2}.*[EW]|\\d{1,2}\\.\\d{1,2}◦.*[EW]|\\d{1,2}\\.\\d{1,2}°*[EW]|[EW]?\\d{1,2}:\\d{1,2}:\\d{1,2}?"

decimal_pattern <- "\\d{1,2}\\.\\d{1,2}◦.*[NSEW]|[NSEW]\\d{1,2}\\.\\d{1,2}◦|\\d{1,2}\\.\\d{1,2}°.*[NSEW]|[NSEW]\\d{1,2}\\.\\d{1,2}°"

colon_pattern <- "[NSEW]?\\d{1,2}:\\d{1,2}:\\d{1,2}?"

normal_pattern <- "\\d{1,2}.*◦\\s*\\d{1,2}.*[NSEW]|\\d{1,2}.*°\\s*\\d{1,2}.*[NSEW]"

alt_pattern <- "\\d{1,2}.*\\s*\\d{1,2}.*[NSEW]"

df <- tibble(df)

x <- df |>
  mutate(dec = str_extract_all(text, decimal_pattern), 
         colon = str_extract_all(text, colon_pattern), 
         normal = str_extract_all(text, normal_pattern)) |>
  select(doc_id, dec, colon, normal)

df1 <- x |>
  hoist("normal") |>
  hoist("dec") |>
  hoist("colon") |>
  pivot_longer(names_to = "pattern", values_to = "coords", 2:4) |>
  unnest("coords") |>
  mutate(degree = ifelse(pattern == "normal", str_extract_all(coords, "\\d{1,2}?°|\\d{1,2}?◦"),
                         ifelse(pattern == "colon", str_extract(coords, "\\d{1,2}:"), 
                                str_extract_all(coords, "\\d{1,2}\\.\\d{1,2}"))),
         minutes = ifelse(pattern == "normal", str_extract_all(coords, \\d{1,2}|◦\\d{1,2}"),
                         ifelse(pattern == "colon", str_extract(coords, ":\\d{1,2}"), coords)),
) |>
  unnest("degree") |>
  unnest("minutes") |>
  mutate(degree = parse_number(degree),
         minutes = parse_number(minutes),
         decimal = ifelse(pattern != "dec", degree + (minutes/60), degree), 
         decimal = round(decimal, 4), 
         point = str_extract_all(coords, "[NSEW]")) |>
  unnest("point") |>
  mutate(lat_long = case_when(decimal < 10 & point %in% c("E", "W") ~"long", 
                              decimal > 10 & point %in% c("N", "S") ~ "lat", 
                              )) |>
  drop_na() |>
  mutate(
         decimal = ifelse(point == "W", -decimal, decimal)
         ) |>
  distinct() 
Code
df2 <- df1 |>
  select(doc_id, coords, decimal, lat_long) |>
  pivot_wider(names_from = "lat_long", values_from = "decimal") |>
  unnest("lat") |>
  unnest("long")

df2
# A tibble: 25 × 4
   doc_id                            coords                            lat  long
   <chr>                             <chr>                           <dbl> <dbl>
 1 1-s2.0-S0016706109001487-main.pdf 52°86' N, 6°54' W                53.4 -7.43
 2 1-s2.0-S0016706109001487-main.pdf 52°86' N, 6°54' W                53.4 -6.9 
 3 1-s2.0-S0016706109001487-main.pdf 52°86' N, 6°54' W                52.9 -7.43
 4 1-s2.0-S0016706109001487-main.pdf 52°86' N, 6°54' W                52.9 -6.9 
 5 1-s2.0-S0048969714017100-main.pdf 1 estimates. In              2…  55.0 -3.03
 6 1-s2.0-S0048969714017100-main.pdf 1 estimates. In              2…  55.0 -3.58
 7 1-s2.0-S0048969714017100-main.pdf 1 estimates. In              2…  55.6 -3.03
 8 1-s2.0-S0048969714017100-main.pdf 1 estimates. In              2…  55.6 -3.58
 9 1-s2.0-S0048969719337271-main.pdf 2011) and emissions associated…  53.4 -3.37
10 1-s2.0-S0048969719337271-main.pdf 2011) and emissions associated…  53.4 -4.58
# … with 15 more rows

We need to filter…not sure how to do this programmatically

Lets create a map…we’ll use the simple features package (sf).

Code
library(sf);library(ggspatial)

df_3 <- st_as_sf(df2, coords = c(x = "long",  y ="lat"), crs = 4326)

df_3 |>
  mapview::mapview(popup = popupTable(df_3, zcol = c("doc_id")))