Lat longs

Julian Flowers

Setup

We’ll load the libraries we need and load the PDFs into R

library(readtext); library(tidyverse); library(here)

source(paste0(here("R", "decimal_coords.R")))

path <- here("my_corpus")

path <- here::here("my_corpus")   ## point R at the pdf directory 

f <- list.files(path, "pdf$", full.names = T)  

## read files into a data frame

y <- map_dfr(f, readtext::readtext)


head(y)
readtext object consisting of 6 documents and 0 docvars.
# Description: df [6 × 2]
  doc_id                            text               
  <chr>                             <chr>              
1 09596830094971.pdf                "\"The Holoce\"..."
2 1-s2.0-003807179500092S-main.pdf  "\"          \"..."
3 1-s2.0-0038071795001867-main.pdf  "\"          \"..."
4 1-s2.0-026974919500009G-main.pdf  "\"          \"..."
5 1-s2.0-S001670611200119X-main.pdf "\"          \"..."
6 1-s2.0-S0016706114001761-main.pdf "\"          \"..."

Coordinate patterns

  • Trying to define generic patterns for extracting lat longs is not straightforward

    • The way lat longs are represented in literature is not standardised

    • Also, when PDFs are read additional characters can be added

    • What is printed is not necessarily what it seems

  • Some general patterns are discernible however:

    • Polar coordinates: - degrees & minutes (and seconds)

      • This is generally represented as one or two numbers followed by a degree symbol, followed by one or two numbers, then (not always) a ’ (for minutes), possibly ’’ (for seconds), then a compass point (NSEW)

        • 52°07 N, 54° 20′ N, 9◦ 34 E
    • Decimal coordinates

      • This is represented as one or two numbers followed by a decimal point, followed by a degree symbol, followed by a compass point

        • 5.37 W
    • Colon pattern

      • This is represented as one or two numbers (degrees) followed by a colon, then one or two numbers (minutes), then a colon, and so on.

        • W54:21:11

Regular expressions

To capture these text patterns we use regular expressions (regexp)

This is a coding system designed to deal with extracting patterns from strings (e.g. email addresses, phone numbers)

Lets start with the colon pattern W54:21:11:

  • The regex for a digit is `\d`
  • To specify one or two numbers we use curly braces
    • \d{1,2} - this means minimum of 1 digit and maximum of 2
  • The whole pattern then looks like
    • \d{1,2}:\d{1,2}:\d{1,2}
  • …not quite - in R, a backslash is a reserved character - it has a specific meaning - to tell R to treat \ as a backslash we need to “escape” it by preceding with the the escape symbol - which is a backslash so we use \\
  • So our colon pattern in R regex becomes (note the quotes)
    • “\\d{1,2}:\\d{1,2}:\\d{1,2}”
  • Finally, this string of numbers needs to be preceded by a compass point N, S, E or W. To specify this we enclose NSEW in square brackets - [NSEW] which gives us
    • “[NSEW]\\d{1,2}:\\d{1,2}:\\d{1,2}”

Matching text

Now we can match this pattern to our text using the str_match_all function in the stringr package (this is loaded when tidyverse is loaded).

## Define pattern (the parentheses allow us to extract each element of the pattern)
pattern <- "[NSEW]\\d{1,2}:\\d{1,2}:\\d{1,2}"

## Extract all matches
str_match_all(y$text, pattern) 
[[1]]
     [,1]

[[2]]
     [,1]

[[3]]
     [,1]

[[4]]
     [,1]

[[5]]
     [,1]

[[6]]
     [,1]

[[7]]
     [,1]

[[8]]
     [,1]

[[9]]
     [,1]

[[10]]
     [,1]

[[11]]
     [,1]

[[12]]
     [,1]

[[13]]
     [,1]

[[14]]
     [,1]

[[15]]
     [,1]

[[16]]
     [,1]

[[17]]
     [,1]

[[18]]
     [,1]

[[19]]
     [,1]

[[20]]
     [,1]

[[21]]
     [,1]

[[22]]
     [,1]

[[23]]
     [,1]

[[24]]
     [,1]

[[25]]
     [,1]

[[26]]
     [,1]

[[27]]
     [,1]

[[28]]
     [,1]

[[29]]
     [,1]

[[30]]
     [,1]

[[31]]
     [,1]

[[32]]
     [,1]

[[33]]
     [,1]

[[34]]
     [,1]

[[35]]
     [,1]

[[36]]
     [,1]

[[37]]
     [,1]

[[38]]
     [,1]

[[39]]
     [,1]

[[40]]
     [,1]

[[41]]
     [,1]

[[42]]
     [,1]

[[43]]
     [,1]

[[44]]
     [,1]

[[45]]
     [,1]

[[46]]
     [,1]

[[47]]
     [,1]

[[48]]
     [,1]

[[49]]
     [,1]       
[1,] "N54:41:18"
[2,] "W2:22:45" 

[[50]]
     [,1]

[[51]]
     [,1]

[[52]]
     [,1]

[[53]]
     [,1]

[[54]]
     [,1]

[[55]]
     [,1]

[[56]]
     [,1]

[[57]]
     [,1]

[[58]]
     [,1]

Matching text 1

We can make a minor modification to our pattern by creating capture groups by enclosing parts of the pattern in parentheses.

This gives us a matrix which splits out parts of the pattern which we can then convert into a data frame for further manipulation.

I like to use the enframe function from the tibble package which coerces lists and matrices into data frame format.

pattern_a <- "([NSEW])(\\d{1,2}):(\\d{1,2}):(\\d{1,2})"

colon_pattern <- str_match_all(y$text, pattern_a) |>
  enframe()

colon_pattern
# A tibble: 58 × 2
    name value        
   <int> <list>       
 1     1 <chr [0 × 5]>
 2     2 <chr [0 × 5]>
 3     3 <chr [0 × 5]>
 4     4 <chr [0 × 5]>
 5     5 <chr [0 × 5]>
 6     6 <chr [0 × 5]>
 7     7 <chr [0 × 5]>
 8     8 <chr [0 × 5]>
 9     9 <chr [0 × 5]>
10    10 <chr [0 × 5]>
# … with 48 more rows

Matching text 2

We can see that only one paper (49) has coordinates following this pattern which we can extract.

colon_pattern <- colon_pattern |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest_longer("value") |>
  as.matrix() |>
  data.frame() |>
  select(paper_id = name, extract = value.1, degree = value.3, minutes = value.4, point = value.2) |>
  mutate(pattern = "colon")

colon_pattern 
  paper_id   extract degree minutes point pattern
1       49 N54:41:18     54      41     N   colon
2       49  W2:22:45      2      22     W   colon

Decimal pattern

After a lot of trial and error a number of other patterns can be calculated.

  • Decimal - "\\d{1,2}\\.\\d{1,2}(◦|°)([NSEW])"
decimal_pattern <- "(\\d{1,2})(\\.)(\\d{1,2}).?(◦|°).?([NSEW])"

dec_pattern <- str_match_all(y$text, decimal_pattern) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest_longer("value") |>
  as.matrix() |>
  data.frame() |>
  select(paper_id = name, extract = value.1, degree = value.2, minutes = value.4, point = value.6) |>
  mutate(pattern = "decimal")
  
dec_pattern |>
  gt::gt()
paper_id extract degree minutes point pattern
17 54.18◦ N 54 18 N decimal
17 2.36◦ E 2 36 E decimal
26 52.30◦ N 52 30 N decimal
26 6.40◦ W 6 40 W decimal
55 10.17 °W 10 17 W decimal
55 10.12 °W 10 12 W decimal
55 10.17 °W 10 17 W decimal
55 9.59 °W 9 59 W decimal
55 9.55 °W 9 55 W decimal
55 9.40 °W 9 40 W decimal
55 9.43 °W 9 43 W decimal
55 9.38 °W 9 38 W decimal
55 51.47 °N 51 47 N decimal
55 51.36 °N 51 36 N decimal
55 51.58 °N 51 58 N decimal
55 51.58 °N 51 58 N decimal
55 51.44 °N 51 44 N decimal
55 51.37 °N 51 37 N decimal
55 51.35 °N 51 35 N decimal
55 51.35 °N 51 35 N decimal

Polar pattern

This more complex because of the variation in the way polar coordinates are represented in articles.

To simplify things we can try and match lats and longs separately. A pattern that seems to work is

  • “(\\d{1,2})(◦|°|)(\\s?)(\\d{1,2})(\\D*)([NS])”

This means - find a text string with one or two numbers followed by a degree symbol, then an optional space, then one or two numbers, then any number of non-numeric characters, then N or S.

polar_pattern_lat <- "(\\d{1,2})(◦|°|)(\\s?)(\\d{1,2})(\\D*)([NS])"
polar_pattern_long <- "(\\d{1,2})(◦|°|)(\\s?)(\\d{1,2})(\\D*)([EW])"

## Extract latitudes
polar_lats <- str_match_all(y$text, polar_pattern_lat) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest("value") |>
  as.matrix() |>
  as.data.frame()

## Extract longitudes
polar_longs <- str_match_all(y$text, polar_pattern_long) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest("value") |>
  as.matrix() |>
  as.data.frame()

## Combine and select and rename fields
polar_coords <- polar_lats |>
  bind_rows(polar_longs) |>
  arrange(name) |>
  select(paper_id = name, extract = value.1, degree = value.2, minutes = value.5, point = value.7) |>
  mutate(pattern = "polar")

polar_coords |>
  gt::gt()
paper_id extract degree minutes point pattern
5 52°07 N 52 07 N polar
5 08°16 W 08 16 W polar
6 52° 8′ N 52 8 N polar
6 54° 20′ N 54 20 N polar
6 8° 19′ W 8 19 W polar
18 56◦ 29 N 56 29 N polar
18 9◦ 34 E 9 34 E polar
23 53°13′N 53 13 N polar
23 4°0ʹW; Fig. S 4 0 S polar
23 4°0ʹW 4 0 W polar
24 52°18′N 52 18 N polar
24 6°30′W 6 30 W polar
25 50°45′N 50 45 N polar
25 3°50′W 3 50 W polar
30 52◦ 31’N 52 31 N polar
30 0◦ 23’E 0 23 E polar
35 53° 30’ N 53 30 N polar
35 6° 10’ E 6 10 E polar
38 55◦ 52’N 55 52 N polar
38 03◦ 02’W 03 02 W polar
42 51° 46'N 51 46 N polar
42 9° 42'E 9 42 E polar

More possible patterns

p1 <- “\\d{1,2}\\D*◦\\D?\\d{1,2}\\D*[NSEWnsew]|\\d{1,2}\\D*°\\D?\\d{1,2}\\D*[NSEW]” ## generic pattern

p2 <- “\\d{1,2}\\.\\d{1,2}.?◦.?[NSEWnsew]|\\d{1,2}\\.\\d{1,2}.?°.?[NSEWnsew]” ## decimal coordinates

p3 <- “[NSEWnsew]\\d{1,2}:\\d{1,2}:\\d{1,2}?” ## colon separated

p4 <- “\\d{4,6}\\D?[NSns].*\\d{4,6}\\D?[EWew]” ## easting / northing

p5 <- “(\\d{1,2})(◦|°|\\.||8?|:)\\s?(\\d{1,2}|\\d{1,2}′|\\d{1,2}\\00.)(\\s*)([NSEW])”

p6 <- “(\\d{1,2})(◦|°|\\.||8?|:|\\001)*[NSEW]”

p7 <- “(\\d{1,2})\\D{1,4}(\\d{1,2}0)\\D{1,3}(\\d{1,2}00)\\s*([NSEWnsew])”

Easting-northing

p4_lat <- "(\\d{4,6})\\D?([NS]).*(\\d{4,6})\\D?([EW])"    

e_n_pattern <- str_match_all(y$text, p4_lat) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest("value") |>
  as.matrix() |>
  as.data.frame() 
  
## convert to decimal cooridnates
enp <- e_n_pattern |>
  mutate(dec = map2(value.4, value.2, decimal_from_en)) |>
  unnest("dec")

enp
# A tibble: 5 × 8
  name  value.1                      value.2 value.3 value.4 value.5   lat  long
  <chr> <chr>                        <chr>   <chr>   <chr>   <chr>   <dbl> <dbl>
1 " 7"  "568570 N, 4835W"            568570  N       4835    W        54.9 -7.49
2 "16"  "458380 N, 28440 E"          458380  N       8440    E        53.9 -6.81
3 "31"  "1000 N, 3\u0001 540 0500 W" 1000    N       0500    W        49.9 -7.50
4 "32"  "528160 N, 88250 W"          528160  N       8250    W        54.5 -6.91
5 "54"  "8199N, 3u4490.8199W"        8199    N       8199    W        50.6 -6.49

Putting it all together

We can combine the outputs of each stage into a final dataset, do some final cleaning and calculate decimal coordinates for use in mapping the locations.

enp1 <- enp |>
  select(paper_id = name, lat, long)

combined <- bind_rows(colon_pattern, dec_pattern, polar_coords) |>
  mutate(dec_coords = ifelse(pattern == "decimal", as.numeric(degree) + as.numeric(minutes) / 100, as.numeric(degree) + as.numeric(minutes) / 60), 
         dec_coords = ifelse(point == "W", dec_coords * -1, dec_coords), 
         lat_lon = case_when(point %in% c("N", "S") ~ "lat", 
                             TRUE ~ "long")) |>
  select(paper_id, dec_coords, lat_lon) |>
  pivot_wider(names_from = "lat_lon", values_from = "dec_coords") |>
  arrange(paper_id) |>
  unnest("lat") |>
  unnest("long") |>
  distinct() |>
  filter(lat != -long) 

combined <- combined |>
  bind_rows(enp1) |>
  arrange(paper_id) 

combined |>
  gt::gt()
paper_id lat long
5 52.11667 -8.2666667
6 52.13333 -8.3166667
6 54.33333 -8.3166667
7 54.88772 -7.4861446
16 53.92507 -6.8091290
17 54.18000 2.3600000
18 56.48333 9.5666667
23 53.21667 -4.0000000
24 52.30000 -6.5000000
25 50.75000 -3.8333333
26 52.30000 -6.4000000
30 52.51667 0.3833333
31 49.85964 -7.4982162
32 54.54877 -6.9117863
35 53.50000 6.1666667
38 55.86667 -3.0333333
42 51.76667 9.7000000
49 54.68333 -2.3666667
54 50.55094 -6.4909419
55 51.47000 -10.1700000
55 51.47000 -10.1200000
55 51.47000 -9.5900000
55 51.47000 -9.5500000
55 51.47000 -9.4000000
55 51.47000 -9.4300000
55 51.47000 -9.3800000
55 51.36000 -10.1700000
55 51.36000 -10.1200000
55 51.36000 -9.5900000
55 51.36000 -9.5500000
55 51.36000 -9.4000000
55 51.36000 -9.4300000
55 51.36000 -9.3800000
55 51.58000 -10.1700000
55 51.58000 -10.1200000
55 51.58000 -9.5900000
55 51.58000 -9.5500000
55 51.58000 -9.4000000
55 51.58000 -9.4300000
55 51.58000 -9.3800000
55 51.44000 -10.1700000
55 51.44000 -10.1200000
55 51.44000 -9.5900000
55 51.44000 -9.5500000
55 51.44000 -9.4000000
55 51.44000 -9.4300000
55 51.44000 -9.3800000
55 51.37000 -10.1700000
55 51.37000 -10.1200000
55 51.37000 -9.5900000
55 51.37000 -9.5500000
55 51.37000 -9.4000000
55 51.37000 -9.4300000
55 51.37000 -9.3800000
55 51.35000 -10.1700000
55 51.35000 -10.1200000
55 51.35000 -9.5900000
55 51.35000 -9.5500000
55 51.35000 -9.4000000
55 51.35000 -9.4300000
55 51.35000 -9.3800000

Mapping reported locations

Mapping decimal coordinates is relatively straightforward with the sf and mapview packages.

First we need to convert our lat longs into sf format

library(sf); library(mapview)

combined_sf <- st_as_sf(combined, coords = c("long", "lat"), crs = 4326)

## save as a csv file for wider use

#st_write(combined_sf, "combined_shp.csv")

combined_sf
Simple feature collection with 61 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -10.17 ymin: 49.85964 xmax: 9.7 ymax: 56.48333
Geodetic CRS:  WGS 84
# A tibble: 61 × 2
   paper_id             geometry
 * <chr>             <POINT [°]>
 1 " 5"     (-8.266667 52.11667)
 2 " 6"     (-8.316667 52.13333)
 3 " 6"     (-8.316667 54.33333)
 4 " 7"     (-7.486145 54.88772)
 5 "16"     (-6.809129 53.92507)
 6 "17"             (2.36 54.18)
 7 "18"      (9.566667 56.48333)
 8 "23"            (-4 53.21667)
 9 "24"              (-6.5 52.3)
10 "25"        (-3.833333 50.75)
# … with 51 more rows

We can then pass this file to mapview which gives an interactive plot.

mapview(combined_sf)

It is clear there are some issues:

  1. 3 points are located in implausible locations (17, 31, 54)
  2. Possibly some points are missing.

The anomalous papers or those not apparently included can be searched

Elevations

There are text patterns for elevations to which we can apply a similar logic - generally a number (1-5 digits), possibly followed by m followed by “elvation” or “a.s.l” or “above sea level”.

el_pattern <- "(elevation of|a.s.l|above sea level)?\\s?(\\d{1,})\\s([m]?)\\s(elevation|a.s.l|above sea level)"

elevations <- str_match_all(y$text, el_pattern) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest("value") |>
  as.matrix() |>
  as.data.frame() |>
  select(paper_id = name, elevation_in_m = value.3 )

elevations |>
  gt::gt()
paper_id elevation_in_m
5 52
11 900
15 1
16 1040
21 40
23 270
26 67
38 190
40 190
41 340
41 160
41 200
44 60
54 140
55 187
58 56

Add this to locations

combined_sf |>
  left_join(elevations)
Simple feature collection with 61 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -10.17 ymin: 49.85964 xmax: 9.7 ymax: 56.48333
Geodetic CRS:  WGS 84
# A tibble: 61 × 3
   paper_id             geometry elevation_in_m
   <chr>             <POINT [°]> <chr>         
 1 " 5"     (-8.266667 52.11667) 52            
 2 " 6"     (-8.316667 52.13333) <NA>          
 3 " 6"     (-8.316667 54.33333) <NA>          
 4 " 7"     (-7.486145 54.88772) <NA>          
 5 "16"     (-6.809129 53.92507) 1040          
 6 "17"             (2.36 54.18) <NA>          
 7 "18"      (9.566667 56.48333) <NA>          
 8 "23"            (-4 53.21667) 270           
 9 "24"              (-6.5 52.3) <NA>          
10 "25"        (-3.833333 50.75) <NA>          
# … with 51 more rows