Code
library(pacman) ## package manager
p_load(tidyverse, readtext, sf, mapview, leaflet, leafpop) ## installs and loads packages if not already installed or loaded
Ecological field studies usually report site location or range.
This note explains how to programmatically extract lat-longs and elevations from pdf publications for further processing (e.g. mapping).
One of the challenges is that there is not consistent reporting pattern so this is based on analyses of 45 documents which form part of the herbivory climate change review.
I outline 3 steps:
There are 2 R packages widely used for reading text and pdf files into R - readtext
and pdftools
.
In this note I will use readtext
I’ve already created a directory to store the pdfs I want to extract the information from.
library(pacman) ## package manager
p_load(tidyverse, readtext, sf, mapview, leaflet, leafpop) ## installs and loads packages if not already installed or loaded
It’s really easy to use readtext
<- here::here("~/Downloads/herbivory-corpus") ## point R at the pdf directory
p
<- list.files(p, "pdf", full.names = T) ## get a list of files
f
<- map_dfr(f, readtext) ## iterate through file list, read in the text and create a data frame
df
## look at first 6 rows of dataframe df
readtext object consisting of 22 documents and 0 docvars.
# Description: df [22 × 2]
doc_id text
<chr> <chr>
1 1-s2.0-0038071795001867-main.pdf "\" \"..."
2 1-s2.0-S0016706109001487-main.pdf "\" \"..."
3 1-s2.0-S0016706114001761-main.pdf "\" \"..."
4 1-s2.0-S0048969714017100-main.pdf "\" \"..."
5 1-s2.0-S0048969718320837-main.pdf "\" \"..."
6 1-s2.0-S0048969719337271-main.pdf "\" \"..."
# … with 16 more rows
There are 2 columns - the file name of document, and the text effectively stored in a cell in the dataframe.
We can now generate a piece of text (regular expression) to try and match lat-longs in the documents.
\\d{1,2}◦\\s*\\d{1,2}.*[NSEW] matches 20◦45’N. Note that in some pdfs what is printed as ° actually turns out to be ◦ when the pdf text is extracted
\\d{1,2}°\\s*\\d{1,2}.*[NSEW] matches 20°45’N.
\\d{1,2}\\.\\d{1,2}◦.*[NSEW] matches 20.45◦W
\\d{1,2}\\.\\d{1,2}°*[NSEW] matches 20.45°E
\\d{1,2}:\\d{1,2}:\\d{1,2}.*[NSEW] matches 20:45:11 W
Sometimes to compass point is at the beginning rather than the end of the expression.
Regular expressions are an essential bit of coding needed to extract information from text but looks like gibberish.
For elevations a quick review of the papers suggest that there are a number of text patterns
elevation
above see level
a.s.l
Lets use these to try and extract elevations from text.
Potential text patterns are.
“\\d{1,}.*elevation|above sea level|a.s.l” find a number with at least 1 digit followed by the words elevation OR a.s.l OR above sea level
“elevation|above sea level|a.s.l.*\\d{1,}” find the words elevation OR a.s.l OR above sea level followed by a number with 1 or more digits
$text |>
dfstr_extract("\\d{1,}.*(elevation|a\\.s\\.l.|above sea Level).*\\d{1,}")
[1] NA
[2] NA
[3] NA
[4] NA
[5] NA
[6] NA
[7] NA
[8] NA
[9] NA
[10] NA
[11] NA
[12] NA
[13] NA
[14] "270 m a.s.l.; 53°13′N, 4°0ʹW; Fig. S1). The field (11.5"
[15] "1997; Soussana buffer zone and checked for comparable elevation within \00510"
[16] "2O dynamics in the soil elevation of 56 m a.s.l, a mean annual rainfall of 824"
[17] "17 mm in June). Mean elevation was 567 ± 4 m and slope was 6 ± 3"
[18] NA
[19] NA
[20] NA
[21] NA
[22] NA
Lets apply these patterns to our texts
# pattern_lat <- "\\d{1,2}◦\\s*\\d{1,2}.*[NS]|\\d{1,2}\\.\\d{1,2}◦.*[NS]|\\d{1,2}\\.\\d{1,2}°*[NS]|[NS]?\\d{1,2}:\\d{1,2}:\\d{1,2}?"
#
# pattern_long <- "\\d{1,2}◦\\s*\\d{1,2}.*[EW]|\\d{1,2}\\.\\d{1,2}◦.*[EW]|\\d{1,2}\\.\\d{1,2}°*[EW]|[EW]?\\d{1,2}:\\d{1,2}:\\d{1,2}?"
<- "\\d{1,2}\\.\\d{1,2}◦.*[NSEW]|[NSEW]\\d{1,2}\\.\\d{1,2}◦|\\d{1,2}\\.\\d{1,2}°.*[NSEW]|[NSEW]\\d{1,2}\\.\\d{1,2}°"
decimal_pattern
<- "[NSEW]?\\d{1,2}:\\d{1,2}:\\d{1,2}?"
colon_pattern
<- "\\d{1,2}.*◦\\s*\\d{1,2}.*[NSEW]|\\d{1,2}.*°\\s*\\d{1,2}.*[NSEW]"
normal_pattern
<- "\\d{1,2}.*\\s*\\d{1,2}.*[NSEW]"
alt_pattern
<- tibble(df)
df
<- df |>
x mutate(dec = str_extract_all(text, decimal_pattern),
colon = str_extract_all(text, colon_pattern),
normal = str_extract_all(text, normal_pattern)) |>
select(doc_id, dec, colon, normal)
<- x |>
df1 hoist("normal") |>
hoist("dec") |>
hoist("colon") |>
pivot_longer(names_to = "pattern", values_to = "coords", 2:4) |>
unnest("coords") |>
mutate(degree = ifelse(pattern == "normal", str_extract_all(coords, "\\d{1,2}?°|\\d{1,2}?◦"),
ifelse(pattern == "colon", str_extract(coords, "\\d{1,2}:"),
str_extract_all(coords, "\\d{1,2}\\.\\d{1,2}"))),
minutes = ifelse(pattern == "normal", str_extract_all(coords, "°\\d{1,2}|◦\\d{1,2}"),
ifelse(pattern == "colon", str_extract(coords, ":\\d{1,2}"), coords)),
|>
) unnest("degree") |>
unnest("minutes") |>
mutate(degree = parse_number(degree),
minutes = parse_number(minutes),
decimal = ifelse(pattern != "dec", degree + (minutes/60), degree),
decimal = round(decimal, 4),
point = str_extract_all(coords, "[NSEW]")) |>
unnest("point") |>
mutate(lat_long = case_when(decimal < 10 & point %in% c("E", "W") ~"long",
> 10 & point %in% c("N", "S") ~ "lat",
decimal |>
)) drop_na() |>
mutate(
decimal = ifelse(point == "W", -decimal, decimal)
|>
) distinct()
<- df1 |>
df2 select(doc_id, coords, decimal, lat_long) |>
pivot_wider(names_from = "lat_long", values_from = "decimal") |>
unnest("lat") |>
unnest("long")
df2
# A tibble: 25 × 4
doc_id coords lat long
<chr> <chr> <dbl> <dbl>
1 1-s2.0-S0016706109001487-main.pdf 52°86' N, 6°54' W 53.4 -7.43
2 1-s2.0-S0016706109001487-main.pdf 52°86' N, 6°54' W 53.4 -6.9
3 1-s2.0-S0016706109001487-main.pdf 52°86' N, 6°54' W 52.9 -7.43
4 1-s2.0-S0016706109001487-main.pdf 52°86' N, 6°54' W 52.9 -6.9
5 1-s2.0-S0048969714017100-main.pdf 1 estimates. In 2… 55.0 -3.03
6 1-s2.0-S0048969714017100-main.pdf 1 estimates. In 2… 55.0 -3.58
7 1-s2.0-S0048969714017100-main.pdf 1 estimates. In 2… 55.6 -3.03
8 1-s2.0-S0048969714017100-main.pdf 1 estimates. In 2… 55.6 -3.58
9 1-s2.0-S0048969719337271-main.pdf 2011) and emissions associated… 53.4 -3.37
10 1-s2.0-S0048969719337271-main.pdf 2011) and emissions associated… 53.4 -4.58
# … with 15 more rows
We need to filter…not sure how to do this programmatically
Lets create a map…we’ll use the simple features package (sf
).
library(sf);library(ggspatial)
<- st_as_sf(df2, coords = c(x = "long", y ="lat"), crs = 4326)
df_3
|>
df_3 ::mapview(popup = popupTable(df_3, zcol = c("doc_id"))) mapview