Lat longs

Julian Flowers

Setup

We’ll load the libraries we need and load the PDFs into R

library(readtext); library(tidyverse); library(here)

source(paste0(here("R", "decimal_coords.R")))

path <- here("my_corpus")

path <- here::here("my_corpus")   ## point R at the pdf directory 

f <- list.files(path, "pdf$", full.names = T)  

## read files into a data frame

y <- map_dfr(f, readtext::readtext)


head(y)

readtext object consisting of 6 documents and 0 docvars.
# Description: df [6 × 2]
  doc_id                            text               
  <chr>                             <chr>              
1 09596830094971.pdf                "\"The Holoce\"..."
2 1-s2.0-003807179500092S-main.pdf  "\"          \"..."
3 1-s2.0-0038071795001867-main.pdf  "\"          \"..."
4 1-s2.0-026974919500009G-main.pdf  "\"          \"..."
5 1-s2.0-S001670611200119X-main.pdf "\"          \"..."
6 1-s2.0-S0016706114001761-main.pdf "\"          \"..."

Coordinate patterns

Trying to define generic patterns for extracting lat longs is not straightforward
- The way lat longs are represented in literature is not standardised
- Also, when PDFs are read additional characters can be added
- What is printed is not necessarily what it seems
Some general patterns are discernible however:
- Polar coordinates: - degrees & minutes (and seconds)
  - This is generally represented as one or two numbers followed by a degree symbol, followed by one or two numbers, then (not always) a ’ (for minutes), possibly ’’ (for seconds), then a compass point (NSEW)
    - 52°07 N, 54° 20′ N, 9◦ 34 E
- Decimal coordinates
  - This is represented as one or two numbers followed by a decimal point, followed by a degree symbol, followed by a compass point
    - 5.37 W
- Colon pattern
  - This is represented as one or two numbers (degrees) followed by a colon, then one or two numbers (minutes), then a colon, and so on.
    - W54:21:11

Regular expressions

To capture these text patterns we use regular expressions (regexp)

This is a coding system designed to deal with extracting patterns from strings (e.g. email addresses, phone numbers)

Lets start with the colon pattern W54:21:11:

The regex for a digit is `\d`
To specify one or two numbers we use curly braces
- \d{1,2} - this means minimum of 1 digit and maximum of 2
The whole pattern then looks like
- \d{1,2}:\d{1,2}:\d{1,2}
…not quite - in R, a backslash is a reserved character - it has a specific meaning - to tell R to treat \ as a backslash we need to “escape” it by preceding with the the escape symbol - which is a backslash so we use \\
So our colon pattern in R regex becomes (note the quotes)
- “\\d{1,2}:\\d{1,2}:\\d{1,2}”
Finally, this string of numbers needs to be preceded by a compass point N, S, E or W. To specify this we enclose NSEW in square brackets - [NSEW] which gives us
- “[NSEW]\\d{1,2}:\\d{1,2}:\\d{1,2}”

Matching text

Now we can match this pattern to our text using the str_match_all function in the stringr package (this is loaded when tidyverse is loaded).

## Define pattern (the parentheses allow us to extract each element of the pattern)
pattern <- "[NSEW]\\d{1,2}:\\d{1,2}:\\d{1,2}"

## Extract all matches
str_match_all(y$text, pattern)

[[1]]
     [,1]

[[2]]
     [,1]

[[3]]
     [,1]

[[4]]
     [,1]

[[5]]
     [,1]

[[6]]
     [,1]

[[7]]
     [,1]

[[8]]
     [,1]

[[9]]
     [,1]

[[10]]
     [,1]

[[11]]
     [,1]

[[12]]
     [,1]

[[13]]
     [,1]

[[14]]
     [,1]

[[15]]
     [,1]

[[16]]
     [,1]

[[17]]
     [,1]

[[18]]
     [,1]

[[19]]
     [,1]

[[20]]
     [,1]

[[21]]
     [,1]

[[22]]
     [,1]

[[23]]
     [,1]

[[24]]
     [,1]

[[25]]
     [,1]

[[26]]
     [,1]

[[27]]
     [,1]

[[28]]
     [,1]

[[29]]
     [,1]

[[30]]
     [,1]

[[31]]
     [,1]

[[32]]
     [,1]

[[33]]
     [,1]

[[34]]
     [,1]

[[35]]
     [,1]

[[36]]
     [,1]

[[37]]
     [,1]

[[38]]
     [,1]

[[39]]
     [,1]

[[40]]
     [,1]

[[41]]
     [,1]

[[42]]
     [,1]

[[43]]
     [,1]

[[44]]
     [,1]

[[45]]
     [,1]

[[46]]
     [,1]

[[47]]
     [,1]

[[48]]
     [,1]

[[49]]
     [,1]       
[1,] "N54:41:18"
[2,] "W2:22:45" 

[[50]]
     [,1]

[[51]]
     [,1]

[[52]]
     [,1]

[[53]]
     [,1]

[[54]]
     [,1]

[[55]]
     [,1]

[[56]]
     [,1]

[[57]]
     [,1]

[[58]]
     [,1]

Matching text 1

We can make a minor modification to our pattern by creating capture groups by enclosing parts of the pattern in parentheses.

This gives us a matrix which splits out parts of the pattern which we can then convert into a data frame for further manipulation.

I like to use the enframe function from the tibble package which coerces lists and matrices into data frame format.

pattern_a <- "([NSEW])(\\d{1,2}):(\\d{1,2}):(\\d{1,2})"

colon_pattern <- str_match_all(y$text, pattern_a) |>
  enframe()

colon_pattern

# A tibble: 58 × 2
    name value        
   <int> <list>       
 1     1 <chr [0 × 5]>
 2     2 <chr [0 × 5]>
 3     3 <chr [0 × 5]>
 4     4 <chr [0 × 5]>
 5     5 <chr [0 × 5]>
 6     6 <chr [0 × 5]>
 7     7 <chr [0 × 5]>
 8     8 <chr [0 × 5]>
 9     9 <chr [0 × 5]>
10    10 <chr [0 × 5]>
# … with 48 more rows

Matching text 2

We can see that only one paper (49) has coordinates following this pattern which we can extract.

colon_pattern <- colon_pattern |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest_longer("value") |>
  as.matrix() |>
  data.frame() |>
  select(paper_id = name, extract = value.1, degree = value.3, minutes = value.4, point = value.2) |>
  mutate(pattern = "colon")

colon_pattern

  paper_id   extract degree minutes point pattern
1       49 N54:41:18     54      41     N   colon
2       49  W2:22:45      2      22     W   colon

Decimal pattern

After a lot of trial and error a number of other patterns can be calculated.

Decimal - "\\d{1,2}\\.\\d{1,2}(◦|°)([NSEW])"

decimal_pattern <- "(\\d{1,2})(\\.)(\\d{1,2}).?(◦|°).?([NSEW])"

dec_pattern <- str_match_all(y$text, decimal_pattern) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest_longer("value") |>
  as.matrix() |>
  data.frame() |>
  select(paper_id = name, extract = value.1, degree = value.2, minutes = value.4, point = value.6) |>
  mutate(pattern = "decimal")
  
dec_pattern |>
  gt::gt()

paper_id	extract	degree	minutes	point	pattern
17	54.18◦ N	54	18	N	decimal
17	2.36◦ E	2	36	E	decimal
26	52.30◦ N	52	30	N	decimal
26	6.40◦ W	6	40	W	decimal
55	10.17 °W	10	17	W	decimal
55	10.12 °W	10	12	W	decimal
55	10.17 °W	10	17	W	decimal
55	9.59 °W	9	59	W	decimal
55	9.55 °W	9	55	W	decimal
55	9.40 °W	9	40	W	decimal
55	9.43 °W	9	43	W	decimal
55	9.38 °W	9	38	W	decimal
55	51.47 °N	51	47	N	decimal
55	51.36 °N	51	36	N	decimal
55	51.58 °N	51	58	N	decimal
55	51.58 °N	51	58	N	decimal
55	51.44 °N	51	44	N	decimal
55	51.37 °N	51	37	N	decimal
55	51.35 °N	51	35	N	decimal
55	51.35 °N	51	35	N	decimal

Polar pattern

This more complex because of the variation in the way polar coordinates are represented in articles.

To simplify things we can try and match lats and longs separately. A pattern that seems to work is

“(\\d{1,2})(◦|°|)(\\s?)(\\d{1,2})(\\D*)([NS])”

This means - find a text string with one or two numbers followed by a degree symbol, then an optional space, then one or two numbers, then any number of non-numeric characters, then N or S.

polar_pattern_lat <- "(\\d{1,2})(◦|°|)(\\s?)(\\d{1,2})(\\D*)([NS])"
polar_pattern_long <- "(\\d{1,2})(◦|°|)(\\s?)(\\d{1,2})(\\D*)([EW])"

## Extract latitudes
polar_lats <- str_match_all(y$text, polar_pattern_lat) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest("value") |>
  as.matrix() |>
  as.data.frame()

## Extract longitudes
polar_longs <- str_match_all(y$text, polar_pattern_long) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest("value") |>
  as.matrix() |>
  as.data.frame()

## Combine and select and rename fields
polar_coords <- polar_lats |>
  bind_rows(polar_longs) |>
  arrange(name) |>
  select(paper_id = name, extract = value.1, degree = value.2, minutes = value.5, point = value.7) |>
  mutate(pattern = "polar")

polar_coords |>
  gt::gt()

paper_id	extract	degree	minutes	point	pattern
5	52°07 N	52	07	N	polar
5	08°16 W	08	16	W	polar
6	52° 8′ N	52	8	N	polar
6	54° 20′ N	54	20	N	polar
6	8° 19′ W	8	19	W	polar
18	56◦ 29 N	56	29	N	polar
18	9◦ 34 E	9	34	E	polar
23	53°13′N	53	13	N	polar
23	4°0ʹW; Fig. S	4	0	S	polar
23	4°0ʹW	4	0	W	polar
24	52°18′N	52	18	N	polar
24	6°30′W	6	30	W	polar
25	50°45′N	50	45	N	polar
25	3°50′W	3	50	W	polar
30	52◦ 31’N	52	31	N	polar
30	0◦ 23’E	0	23	E	polar
35	53° 30’ N	53	30	N	polar
35	6° 10’ E	6	10	E	polar
38	55◦ 52’N	55	52	N	polar
38	03◦ 02’W	03	02	W	polar
42	51° 46'N	51	46	N	polar
42	9° 42'E	9	42	E	polar

More possible patterns

p1 <- “\\d{1,2}\\D*◦\\D?\\d{1,2}\\D*[NSEWnsew]|\\d{1,2}\\D*°\\D?\\d{1,2}\\D*[NSEW]” ## generic pattern

p2 <- “\\d{1,2}\\.\\d{1,2}.?◦.?[NSEWnsew]|\\d{1,2}\\.\\d{1,2}.?°.?[NSEWnsew]” ## decimal coordinates

p3 <- “[NSEWnsew]\\d{1,2}:\\d{1,2}:\\d{1,2}?” ## colon separated

p4 <- “\\d{4,6}\\D?[NSns].*\\d{4,6}\\D?[EWew]” ## easting / northing

p5 <- “(\\d{1,2})(◦|°|\\.||8?|:)\\s?(\\d{1,2}|\\d{1,2}′|\\d{1,2}\\00.)(\\s*)([NSEW])”

p6 <- “(\\d{1,2})(◦|°|\\.||8?|:|\\001)*[NSEW]”

p7 <- “(\\d{1,2})\\D{1,4}(\\d{1,2}0)\\D{1,3}(\\d{1,2}00)\\s*([NSEWnsew])”

Easting-northing

p4_lat <- "(\\d{4,6})\\D?([NS]).*(\\d{4,6})\\D?([EW])"    

e_n_pattern <- str_match_all(y$text, p4_lat) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest("value") |>
  as.matrix() |>
  as.data.frame() 
  
## convert to decimal cooridnates
enp <- e_n_pattern |>
  mutate(dec = map2(value.4, value.2, decimal_from_en)) |>
  unnest("dec")

enp

# A tibble: 5 × 8
  name  value.1                      value.2 value.3 value.4 value.5   lat  long
  <chr> <chr>                        <chr>   <chr>   <chr>   <chr>   <dbl> <dbl>
1 " 7"  "568570 N, 4835W"            568570  N       4835    W        54.9 -7.49
2 "16"  "458380 N, 28440 E"          458380  N       8440    E        53.9 -6.81
3 "31"  "1000 N, 3\u0001 540 0500 W" 1000    N       0500    W        49.9 -7.50
4 "32"  "528160 N, 88250 W"          528160  N       8250    W        54.5 -6.91
5 "54"  "8199N, 3u4490.8199W"        8199    N       8199    W        50.6 -6.49

Putting it all together

We can combine the outputs of each stage into a final dataset, do some final cleaning and calculate decimal coordinates for use in mapping the locations.

enp1 <- enp |>
  select(paper_id = name, lat, long)

combined <- bind_rows(colon_pattern, dec_pattern, polar_coords) |>
  mutate(dec_coords = ifelse(pattern == "decimal", as.numeric(degree) + as.numeric(minutes) / 100, as.numeric(degree) + as.numeric(minutes) / 60), 
         dec_coords = ifelse(point == "W", dec_coords * -1, dec_coords), 
         lat_lon = case_when(point %in% c("N", "S") ~ "lat", 
                             TRUE ~ "long")) |>
  select(paper_id, dec_coords, lat_lon) |>
  pivot_wider(names_from = "lat_lon", values_from = "dec_coords") |>
  arrange(paper_id) |>
  unnest("lat") |>
  unnest("long") |>
  distinct() |>
  filter(lat != -long) 

combined <- combined |>
  bind_rows(enp1) |>
  arrange(paper_id) 

combined |>
  gt::gt()

paper_id	lat	long
5	52.11667	-8.2666667
6	52.13333	-8.3166667
6	54.33333	-8.3166667
7	54.88772	-7.4861446
16	53.92507	-6.8091290
17	54.18000	2.3600000
18	56.48333	9.5666667
23	53.21667	-4.0000000
24	52.30000	-6.5000000
25	50.75000	-3.8333333
26	52.30000	-6.4000000
30	52.51667	0.3833333
31	49.85964	-7.4982162
32	54.54877	-6.9117863
35	53.50000	6.1666667
38	55.86667	-3.0333333
42	51.76667	9.7000000
49	54.68333	-2.3666667
54	50.55094	-6.4909419
55	51.47000	-10.1700000
55	51.47000	-10.1200000
55	51.47000	-9.5900000
55	51.47000	-9.5500000
55	51.47000	-9.4000000
55	51.47000	-9.4300000
55	51.47000	-9.3800000
55	51.36000	-10.1700000
55	51.36000	-10.1200000
55	51.36000	-9.5900000
55	51.36000	-9.5500000
55	51.36000	-9.4000000
55	51.36000	-9.4300000
55	51.36000	-9.3800000
55	51.58000	-10.1700000
55	51.58000	-10.1200000
55	51.58000	-9.5900000
55	51.58000	-9.5500000
55	51.58000	-9.4000000
55	51.58000	-9.4300000
55	51.58000	-9.3800000
55	51.44000	-10.1700000
55	51.44000	-10.1200000
55	51.44000	-9.5900000
55	51.44000	-9.5500000
55	51.44000	-9.4000000
55	51.44000	-9.4300000
55	51.44000	-9.3800000
55	51.37000	-10.1700000
55	51.37000	-10.1200000
55	51.37000	-9.5900000
55	51.37000	-9.5500000
55	51.37000	-9.4000000
55	51.37000	-9.4300000
55	51.37000	-9.3800000
55	51.35000	-10.1700000
55	51.35000	-10.1200000
55	51.35000	-9.5900000
55	51.35000	-9.5500000
55	51.35000	-9.4000000
55	51.35000	-9.4300000
55	51.35000	-9.3800000

Mapping reported locations

Mapping decimal coordinates is relatively straightforward with the sf and mapview packages.

First we need to convert our lat longs into sf format

library(sf); library(mapview)

combined_sf <- st_as_sf(combined, coords = c("long", "lat"), crs = 4326)

## save as a csv file for wider use

#st_write(combined_sf, "combined_shp.csv")

combined_sf

Simple feature collection with 61 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -10.17 ymin: 49.85964 xmax: 9.7 ymax: 56.48333
Geodetic CRS:  WGS 84
# A tibble: 61 × 2
   paper_id             geometry
 * <chr>             <POINT [°]>
 1 " 5"     (-8.266667 52.11667)
 2 " 6"     (-8.316667 52.13333)
 3 " 6"     (-8.316667 54.33333)
 4 " 7"     (-7.486145 54.88772)
 5 "16"     (-6.809129 53.92507)
 6 "17"             (2.36 54.18)
 7 "18"      (9.566667 56.48333)
 8 "23"            (-4 53.21667)
 9 "24"              (-6.5 52.3)
10 "25"        (-3.833333 50.75)
# … with 51 more rows

We can then pass this file to mapview which gives an interactive plot.

mapview(combined_sf)

It is clear there are some issues:

3 points are located in implausible locations (17, 31, 54)
Possibly some points are missing.

The anomalous papers or those not apparently included can be searched

Elevations

There are text patterns for elevations to which we can apply a similar logic - generally a number (1-5 digits), possibly followed by m followed by “elvation” or “a.s.l” or “above sea level”.

el_pattern <- "(elevation of|a.s.l|above sea level)?\\s?(\\d{1,})\\s([m]?)\\s(elevation|a.s.l|above sea level)"

elevations <- str_match_all(y$text, el_pattern) |>
  enframe() |>
  mutate(dim = map(value, nrow)) |>
  filter(dim > 0) |>
  select(-dim) |>
  unnest("value") |>
  as.matrix() |>
  as.data.frame() |>
  select(paper_id = name, elevation_in_m = value.3 )

elevations |>
  gt::gt()

paper_id	elevation_in_m
5	52
11	900
15	1
16	1040
21	40
23	270
26	67
38	190
40	190
41	340
41	160
41	200
44	60
54	140
55	187
58	56

Add this to locations

combined_sf |>
  left_join(elevations)

Simple feature collection with 61 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -10.17 ymin: 49.85964 xmax: 9.7 ymax: 56.48333
Geodetic CRS:  WGS 84
# A tibble: 61 × 3
   paper_id             geometry elevation_in_m
   <chr>             <POINT [°]> <chr>         
 1 " 5"     (-8.266667 52.11667) 52            
 2 " 6"     (-8.316667 52.13333) <NA>          
 3 " 6"     (-8.316667 54.33333) <NA>          
 4 " 7"     (-7.486145 54.88772) <NA>          
 5 "16"     (-6.809129 53.92507) 1040          
 6 "17"             (2.36 54.18) <NA>          
 7 "18"      (9.566667 56.48333) <NA>          
 8 "23"            (-4 53.21667) 270           
 9 "24"              (-6.5 52.3) <NA>          
10 "25"        (-3.833333 50.75) <NA>          
# … with 51 more rows