Fast spatial data matching in R
How to match locations based on their coordinates
This document was prompted by this reddit question on how to fill missing location names by matching them to other known locations by their geographical proximity (using lat/long coordinates). The question is linked to a case study of Google’s Data Analytics certificate.
To answer the question in a nutshell: the proposed matching by position won’t yield anything with this dataset since the locations missing a name or id are missing one because their coordinates lack the precision necessary to be reliably matched to another station by proximity alone.
Please visit this page for a more up-to-date version of this post.
You can check the source code by clicking on the </> Code button at the top-right.
1 Setup
library(here) # File path management
library(fs) # File & folder manipulation
library(pipebind) # Piping goodies
library(data.table) # Fast data manipulation
library(dplyr) # Slower (but more readable) data manipulation
library(dtplyr) # data.table backend for dplyr
library(tidyr) # Extra convenience functions for data manipulation
library(DBI) # Database connection
library(dbplyr) # SQL back-end for dplyr
library(duckdb) # Quack Stack
library(stringr) # Manipulating strings
library(purrr) # Manipulating lists
library(fuzzyjoin) # Non-equi joins & coordinates-based joins
options(
dplyr.strict_sql = FALSE,
scipen = 999L,
digits = 4L,
knitr.max_rows_print = 10
)
data.table::setDTthreads(parallel::detectCores(logical = TRUE))─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os Ubuntu 20.04.4 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz Europe/Paris
date 2022-09-24
pandoc 2.19.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
Quarto 1.1.251
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P data.table * 1.14.3 2022-07-27 [?] Github (Rdatatable/data.table@c4a2085)
P DBI * 1.1.3 2022-06-18 [?] CRAN (R 4.2.0)
P dbplyr * 2.2.1 2022-06-27 [?] CRAN (R 4.2.0)
dplyr * 1.0.99.9000 2022-08-15 [1] Github (tidyverse/dplyr@d8294b4)
P dtplyr * 1.2.1 2022-01-19 [?] CRAN (R 4.2.0)
duckdb * 0.5.0 2022-09-03 [1] https://duckdb.r-universe.dev (R 4.2.1)
fs * 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
fuzzyjoin * 0.1.6 2020-05-15 [1] CRAN (R 4.2.0)
P here * 1.0.1 2020-12-13 [?] CRAN (R 4.2.0)
pipebind * 0.1.1 2022-08-10 [1] CRAN (R 4.2.0)
P purrr * 0.3.4 2020-04-17 [?] CRAN (R 4.2.0)
P stringr * 1.4.0 2019-02-10 [2] CRAN (R 4.2.0)
P tidyr * 1.2.0 2022-02-01 [?] CRAN (R 4.2.0)
[1] /home/mar/Dev/Projects/R/Misc/renv/library/R-4.2/x86_64-pc-linux-gnu
[2] /home/mar/.cache/R/renv/library/Misc-f25fd835/R-4.2/x86_64-pc-linux-gnu
[3] /usr/lib/R/library
[4] /usr/local/lib/R/site-library
[5] /usr/lib/R/site-library
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
2 Loading the data
2.1 Wide format (original)
rides <- (purrr::map_dfr(
fs::dir_ls(data_path, glob = "*.csv"),
\(file) fread(file, na.strings = "")
)
)
setkey(rides, ride_id)Time difference of 17.34 secs
data.frame [5,860,776 x 13]
| [ omitted 5,860,761 entries ] |
2.2 Long format
rides_l <- melt(
rides,
measure = measure(way, value.name, pattern = "(end|start).*(name|id|lat|lng)")
)
setkey(rides_l, ride_id, id)Time difference of 6.138 secs
rides_l.dtp <- (pivot_longer(
rides,
matches("^end_|^start_"),
names_pattern = "(end|start).*(name|id|lat|lng)",
names_to = c("way", ".value")
)
|> as.data.table()
)
setkey(rides_l.dtp, ride_id, id)Time difference of 10.41 secs
CREATE TABLE rides_l AS
(
SELECT
ride_id, rideable_type, started_at, ended_at,member_casual
, 'start' AS way
, start_station_name AS name
, start_station_id AS id
, start_lat AS lat
, start_lng AS lng
FROM rides
)
UNION ALL
(
SELECT
ride_id, rideable_type, started_at, ended_at, member_casual
, 'end' AS way
, end_station_name AS name
, end_station_id AS id
, end_lat AS lat
, end_lng AS lng
FROM rides
);
CREATE INDEX station_idx ON rides_l (id);No need to re-define an index for ride_id here, it was passed down from the table rides
Time difference of 6.508 secs
We can reuse an existinf df and directly add it (or bind it as a view) to the database:
DBI::dbWriteTable(rides_con, "rides_l", rides_l) # As a TABLE
duckdb::duckdb_register(rides_con, "rides_l", rides_l) # As a VIEWdata.frame [11,721,552 x 10]
| [ omitted 11,721,537 entries ] |
3 Exploring the data
3.1 Cleaning useless coordinates
Removing entries were lat/lng do not have sufficient precision to be reliably matched to a station (i.e. entries having less than 4 decimals, which corresponds to a 11 meters “radius” at the equator).
| Decimal | Distance at the equator (m) |
|---|---|
| 0 | 111,120 |
| 1 | 11,112 |
| 2 | 1,111.2 |
| 3 | 111.12 |
| 4 | 11.112 |
| 5 | 1.1112 |
decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) >= 4CREATE FUNCTION decp(x) AS length(str_split(CAST(abs(x) AS VARCHAR(10)), '.')[2]) >= 4rides_l_clean <- rides_l[decp(lat) & decp(lng), ]
setkey(rides_l_clean, ride_id)Time difference of 25.32 secs
rides_l_clean.dtp <- (rides_l
|> filter(decp(lat) & decp(lng))
|> as.data.table()
)
setkey(rides_l_clean.dtp, ride_id)Time difference of 26.81 secs
CREATE TABLE rides_l_clean AS
SELECT * FROM rides_l
WHERE decp(lat) AND decp(lng)Time difference of 5.369 secs
Here, dbplyr will leave the decp call as-is in the SQL translation, but since we have previously defined a decp SQL function, this function will get called when the SQL query is executed.
Time difference of 16.91 secs
data.table [9,937,077 x 10]
| [ omitted 9,937,062 entries ] |
3.2 What’s missing ?
Entries missing one or both coordinates but having an id or name:
SELECT * FROM rides_l_clean
WHERE ((lat IS NULL) OR (lng IS NULL)
AND (NOT((id IS NULL)) OR NOT((name IS NULL))))data.frame [0 x 10]
Entries missing either name or id, but having coordinates:
CREATE TABLE rides_l_clean_unk AS
SELECT * FROM rides_l_clean
WHERE ((id IS NULL) OR (name IS NULL))
AND (NOT((lat IS NULL)) AND NOT((lng IS NULL)))data.frame [3 x 10]
It seems there are only 3 entries missing identification that could be matched based on their coordinates at the level of precision we use (11m / 4 decimals).
Although, if we look at the original dataset (before filtering the inaccurate coordinates):
CREATE TABLE rides_l_unk AS
SELECT * FROM rides_l
WHERE ((id IS NULL) OR (name IS NULL))
AND (NOT((lat IS NULL)) AND NOT((lng IS NULL)))data.frame [1,696,469 x 10]
| [ omitted 1,696,454 entries ] |
There were 1,696,469 missing stations’ id or name that could have been filled in the original data, but it seems that all of them disappeared when we filtered the coordinates with less than 4 decimals of precision. It would seem that those entries were missing their id/name in the source data because their coordinates were too imprecise to be matched to any station in the first place.
4 Stations data
4.1 Creating stations data
First, we need to assemble a dataset linking each unique station id (and name) with a set of coordinates (here, we use the average lat & lng)
stations_clean <- ((rides_l_clean
|> na.omit(cols = c("id", "name"))
|> dcast(id + name ~ ., fun.aggregate = list(min, max, mean), value.var = c("lat", "lng"))
|> pipebind::bind(x, setcolorder(x, c("id", "name", str_subset(names(x), "lat_|_lng"))))
|> unique(by = "id")
)
)
setkey(stations_clean, id)CREATE TABLE stations_clean AS
SELECT DISTINCT on(id)
id, name
, MIN(lat) AS lat_min
, MAX(lat) AS lat_max
, AVG(lat) AS lat_mean
, MIN(lng) AS lng_min
, MAX(lng) AS lng_max
, AVG(lng) AS lng_mean
FROM rides_l_clean
WHERE (NOT((id IS NULL))) AND (NOT((name IS NULL)))
GROUP BY id, name;data.frame [693 x 8]
| [ omitted 678 entries ] |
5 Matching missing id by position
Let’s match the entries of rides_l_clean with stations_clean by proximity:
5.1 Matching on the cleaned data
To save time, let’s only apply the procedure to the entries that actually need to be matched (i.e. the ones having coordinates but missing either name or id).
There are 3 entries from rides_l_clean that could be position-matched to a known station.
matched_clean <- (fuzzyjoin::geo_inner_join(
as.data.frame(rides_l_clean_unk),
as.data.frame(stations_clean),
by = c("lng" = "lng_mean", "lat" = "lat_mean"),
method = "haversine",
unit = "km",
max_dist = 0.011, # 11 meters
distance_col = "dist"
)
|> mutate(name = coalesce(name.x, name.y), id = coalesce(id.x, id.y))
|> select(names(rides_l_clean), dist)
|> arrange(ride_id)
|> drop_na(ride_id, id, name)
|> setDT()
)
setkey(matched_clean, ride_id, id)Time difference of 0.4161 secs
There are 3 entries from rides_l_clean that could be position-matched to a known station.
Creating the haversine distance function:
CREATE FUNCTION haversine(lat1, lng1, lat2, lng2)
AS ( 6371 * acos( cos( radians(lat1) ) *
cos( radians(lat2) ) * cos( radians(lng2) - radians(lng1) ) +
sin( radians(lat1) ) * sin( radians(lat2) ) )
);Doing the matching:
CREATE TABLE matched_clean AS
SELECT
ride_id, rideable_type, started_at, ended_at, member_casual, way
, COALESCE(r.name, s.name) AS name
, COALESCE(r.id, s.id) AS id
, r.lat, r.lng
, haversine(s.lat_mean, s.lng_mean, r.lat, r.lng) AS dist
FROM rides_l_clean_unk r, stations_clean s
WHERE dist <= 0.011Time difference of 0.004525 secs
data.table [3 x 11]
And we indeed get three matches !
But those three already had an id, so we could probably have filled their missing name using stations_clean directly, instead of a convoluted proximity-based matching (which is more ressource intensive and less precise).
stations_clean[matched_clean, on = .(id)
][, .(id, name.stations = name, name.proximity = i.name, lat, lng)]data.table [3 x 5]
At least, we can see that the proximity-based matched name and the one associated to that station in stations_clean are the same, so the proximity-matching method works reasonably well.
5.2 Matching on the original data
What if we did the same procedure on the non-cleaned data (the one with coordinates less precise than our criteria for matching) ?
5.2.1 Unfiltered stations data
First, we need to recompute the stations data from rides_l (i.e. rides data before cleaning):
CREATE TABLE stations AS
SELECT DISTINCT on(id)
id, name
, MIN(lat) AS lat_min
, MAX(lat) AS lat_max
, AVG(lat) AS lat_mean
, MIN(lng) AS lng_min
, MAX(lng) AS lng_max
, AVG(lng) AS lng_mean
FROM rides_l
WHERE (NOT((id IS NULL))) AND (NOT((name IS NULL)))
GROUP BY id, name;data.frame [1,078 x 8]
| [ omitted 1,063 entries ] |
Cleaning the results:
Notice we get a lot more entries in our stations: 1078 entries vs 693 entries in the filtered version.
Which entries are in stations but not in stations_clean ?
(stations_diff <- stations[!stations_clean, on = .(id, name)])data.table [417 x 8]
| [ omitted 402 entries ] |
And which of those 417 entries have the necessary coordinate precision to be used later on to match against the unknown stations ?
data.table [0 x 8]
As expected, none. But we’re still going to do the matching, for posterity !
5.2.2 Position-matching on stations
To save time, let’s only apply the procedure to the entries that actually need to be matched (i.e. the ones having coordinates but missing either name or id):
There are 1,696,469 entries from rides_l that could be position-matched to a known station.
matched <- (fuzzyjoin::geo_inner_join(
as.data.frame(rides_l_unk),
as.data.frame(stations),
by = c("lng" = "lng_mean", "lat" = "lat_mean"),
method = "haversine",
unit = "km",
max_dist = 0.011, # 11 meters
distance_col = "dist"
)
|> mutate(name = coalesce(name.x, name.y), id = coalesce(id.x, id.y))
|> select(names(rides_l), dist)
|> arrange(ride_id)
|> drop_na(ride_id, id, name)
|> setDT()
)
setkey(matched, ride_id, id)Time difference of 2.143 secs
There are 1,696,469 entries from rides_l that could be position-matched to a known station.
SELECT
ride_id, rideable_type, started_at, ended_at, member_casual, way
, COALESCE(r.name, s.name) AS name
, COALESCE(r.id, s.id) AS id
, r.lat, r.lng
, haversine(s.lat_mean, s.lng_mean, r.lat, r.lng) AS dist
FROM rides_l_unk r, stations s
WHERE dist <= 0.011Time difference of 9.838 secs
data.table [939,829 x 11]
| [ omitted 939,814 entries ] |
Notice how fast the procedure is, with close to 1 million matches (even if the results are mostly garbage).
What’s inside those matches ?
matched[, .(`Number of matches for an entry` = .N), by = .(ride_id, way)
][, .(`Number of times it happens` = .N), by = `Number of matches for an entry`]data.table [5 x 2]
We can see that more than half of the matches are coordinates that matched 2 or more stations, which we should definitely not keep.
But, among the ones with only one match, how many have coordinates precise enough to make that match in the first place (i.e. have 4 or more decimals or precision) ?
matched[, if(.N == 1) .SD, by = .(ride_id, way)][decp(lat) & decp(lng)]data.table [3 x 11]
As it turns out ? Only 3. And those are the same three matches we got from the filtered data.
In the end, those three are the only three position-based matches we should reasonably keep !
6 Updating the original dataset
Finally, we need to update the original dataset (rides_l_clean) with the entries that were position-matched (matched_clean):
6.1 Merging the two datasets
Time difference of 7.555 secs
Time difference of 9.505 secs
dplyr has the neat rows_* series of functions that can easily replace or patch (i.e. only replace missing values) the content of one dataset by another, when the rows match, which is quite fast !
Time difference of 3.587 secs
CREATE TABLE rides_l_merged AS
SELECT
ride_id, rideable_type, started_at, ended_at, member_casual, way,
COALESCE(id_x, id_y) AS id,
COALESCE(name_x, name_y) AS name,
lat, lng
FROM (
SELECT
r.ride_id AS ride_id,
r.rideable_type AS rideable_type,
r.started_at AS started_at,
r.ended_at AS ended_at,
r.member_casual AS member_casual,
r.way AS way,
m.name AS name_x,
m.id AS id_x,
r.lat AS lat,
r.lng AS lng,
r.name AS name_y,
r.id AS id_y
FROM matched_clean AS m
RIGHT JOIN rides_l_clean AS r
ON m.ride_id = r.ride_id
AND m.rideable_type = r.rideable_type
AND m.started_at = r.started_at
AND m.ended_at = r.ended_at
AND m.member_casual = r.member_casual
AND m.way = r.way
);Time difference of 4.306 secs
data.table [9,937,077 x 10]
| [ omitted 9,937,062 entries ] |
6.2 Validating the merge:
We can see that the resulting dataset no longer has any entries that have coordinates but miss a name or an id, whereas there were three before. We have successfully updated them !
6.3 Pivoting back to the original (wide) format
To finish, let’s pivot the resulting data back into the wider format it was originally in:
rides_merged <- dcast(
rides_l_merged,
... ~ way,
value.var = c("name", "id", "lat", "lng"), sep = "_station_"
)Time difference of 9.425 secs
rides_merged.dtp <- (rides_l_merged
|> pivot_wider(
names_from = "way",
values_from = c("name", "id", "lat", "lng"),
names_glue = "{way}_station_{.value}"
)
|> collect()
)Time difference of 10.02 secs
CREATE TABLE rides_merged AS
SELECT
ride_id, rideable_type, started_at, ended_at, member_casual,
MAX(CASE WHEN (way = 'start') THEN name END) AS start_station_name,
MAX(CASE WHEN (way = 'end') THEN name END) AS end_station_name,
MAX(CASE WHEN (way = 'start') THEN id END) AS start_station_id,
MAX(CASE WHEN (way = 'end') THEN id END) AS end_station_id,
MAX(CASE WHEN (way = 'start') THEN lat END) AS start_lat,
MAX(CASE WHEN (way = 'end') THEN lat END) AS end_lat,
MAX(CASE WHEN (way = 'start') THEN lng END) AS start_lng,
MAX(CASE WHEN (way = 'end') THEN lng END) AS end_lng
FROM rides_l_merged
GROUP BY ride_id, rideable_type, started_at, ended_at, member_casualTime difference of 6.463 secs
data.table [5,316,218 x 13]
| [ omitted 5,316,203 entries ] |
We get less than the original (wide format) ~6 millions entries due to having removed (filtered) the entries with bad coordinates.