Trajectory similarity calculation using Dynamic Time Warping

Calculating simmilarity between spatial tracks is a rather complicated task, since trajectories are located in both geographical space and time and can have a number of attributes attached to them. Moreover, attributes can be attached to trajectory as a whole (name of pilot), to individual vetrexes (altitude of plane) or to links (average speed between two recorded locations).

This text will show, how similarity between trajectories in geographical space can be calculated using Dynamic Time Warping procedure.

Sample data can be downloaded from github.

library(dtw)  # dynamic time warping
library(sp)  # handle spatial objects
library(rgeos)  # geoprocessing functions

# read tracks as data frame
tracks <- read.delim("track01.dat", header = T, sep = "\t", dec = ",")

# view data
str(tracks)

## 'data.frame':    13579 obs. of  10 variables:
##  $ INDEX       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ TRACK.NUMBER: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ LOCAL.DATE  : Factor w/ 16 levels "10.4.2012","11.4.2012",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ LOCAL.TIME  : Factor w/ 7044 levels "19:18:23","19:18:24",..: 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 ...
##  $ LATITUDE    : num  49.2 49.2 49.2 49.2 49.2 ...
##  $ N.S         : Factor w/ 1 level "N": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LONGITUDE   : num  16.5 16.5 16.5 16.5 16.5 ...
##  $ E.W         : Factor w/ 2 levels "","E": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ALTITUDE    : num  227 227 228 228 230 ...
##  $ SPEED       : num  37.6 41.8 43.2 42.6 44 ...

Data frame tracks consists of 16 GPS tracks of cars, driving approximately the same route. Every track consists of about 850 points with various attributes, including date, time, position and speed.

# convert to spatial object in WGS
geo_tracks <- tracks
coordinates(geo_tracks) <- ~LONGITUDE + LATITUDE

# convert to SpatialLines object
track_ids <- unique(geo_tracks$TRACK.NUMBER)

sl <- lapply(track_ids, FUN = function(x) {
    crds <- coordinates(subset(geo_tracks, TRACK.NUMBER == x))
    return(Lines(slinelist = list(Line(coords = crds)), ID = x))
})

sl <- SpatialLines(sl)

# all tracks follow the same route
plot(sl)

plot of chunk geo.transform

By transformation to SpatialLines, attribute values for individual vertexes were discarded. Dynamic Time Warping is, however, independent on number of dimensions. Following code shows, that the distance between two tracks can be very well calculated using number of attributes.

# take first two tracks - longitude, latitude, altitude and speed of cars
track1 <- data.frame(lon = tracks$LONGITUDE[tracks$TRACK.NUMBER == 1], lat = tracks$LATITUDE[tracks$TRACK.NUMBER == 
    1], alt = tracks$ALTITUDE[tracks$TRACK.NUMBER == 1], speed = tracks$SPEED[tracks$TRACK.NUMBER == 
    1])

track2 <- data.frame(lon = tracks$LONGITUDE[tracks$TRACK.NUMBER == 2], lat = tracks$LATITUDE[tracks$TRACK.NUMBER == 
    2], alt = tracks$ALTITUDE[tracks$TRACK.NUMBER == 2], speed = tracks$SPEED[tracks$TRACK.NUMBER == 
    2])

# calculate distance in all 4 dimensions
align <- dtw(track1, track2)
align$distance

## [1] 8800

# calculate distance only in geographical space
align <- dtw(track1[, 1:2], track2[, 1:2])
align$distance

## [1] 0.1001

# calculate distance only between speeds
align <- dtw(track1$speed, track2$speed)
align$distance

## [1] 2582

Calculating distance in both attribute and geographical space is, however, not advisable. Every attribute has different units and from the example above it is clear, that variable speed has far greater influence on calculated distance than geographical coordinates.

First, define function to calculate distances between two sets of spatial lines (or between all pair of one set).

# function to calculate dwt distance between spatial lines
dtwDistance <- function(spgeom1, spgeom2 = NULL) {
    # if second set of lines is not given, calculate pairwise distances within
    # first set of lines
    if (is.null(spgeom2)) {

        # prepare empty distance matrix
        n_geoms <- length(spgeom1)
        distmat <- matrix(rep(0, n_geoms^2), ncol = n_geoms)

        # fill the matrix
        for (i in 1:(n_geoms - 1)) {
            crds1 <- coordinates(spgeom1[i, ])[[1]][[1]]
            for (j in (i + 1):n_geoms) {

                crds2 <- coordinates(spgeom1[j, ])[[1]][[1]]
                align <- dtw(crds1, crds2)
                distmat[i, j] <- distmat[j, i] <- align$normalizedDistance  # normalized distance

            }
        }

        # if two sets of lines are given, calculate pairwise distances
    } else {

        # prepare empty distance matrix
        n_geoms1 <- length(spgeom1)
        n_geoms2 <- length(spgeom2)
        distmat <- matrix(rep(0, n_geoms1 * n_geoms2), nrow = n_geoms1)

        # fill the matrix
        for (i in 1:n_geoms1) {
            crds1 <- coordinates(spgeom1[i, ])[[1]][[1]]
            for (j in 1:n_geoms2) {

                crds2 <- coordinates(spgeom2[j, ])[[1]][[1]]
                align <- dtw(crds1, crds2)
                distmat[i, j] <- align$normalizedDistance

            }
        }

    }
    return(distmat)
}

Hierarchical clustering clearly shows three different sets of tracks

# calculate distance matrix and perform clustering
dm <- dtwDistance(sl)

fit <- hclust(as.dist(dm), method = "single")
plot(fit, xlab = "track number")
rect.hclust(fit, k = 3, border = "red")

plot of chunk clustering

Track 15, which is most different from the others, is much shorter than the rest of tracks. Its start and end points are highlighted on the figure below.

# plot - highlight track 15, which is most different
sort(gLength(sl, byid = T))  # track 15 is much shorter

##     15      1      2      3      9     16     11      8     12     10 
## 0.1585 0.1709 0.1710 0.1713 0.1749 0.1750 0.1752 0.1753 0.1754 0.1755 
##      7     14      4      6      5     13 
## 0.1756 0.1757 0.1757 0.1757 0.1758 0.1758

plot(sl, col = cutree(fit, 3), axes = TRUE)
crds_cl3 <- coordinates(sl[15, ])[[1]][[1]]
points(crds_cl3[c(1, nrow(crds_cl3)), ], pch = 16, cex = 1.6)  # highlight start and end points

plot of chunk track15

The difference between clusters 1 and 2 consists in small detour taken by cars in cluster 2, which is highlighted in figure below.

# zoom to area, where is difference between cluster 1 and 2
plot(sl, col = cutree(fit, 3), axes = TRUE, xlim = c(16.52, 16.53), ylim = c(49.29, 
    49.3))
legend("topright", legend = c("cluster 1", "cluster 2", "cluster 3"), col = c(1, 
    2, 3), lty = 1)

plot of chunk zoom

Overall, Dynamic Time Warping was very succesfull in sorting available tracks info three distinct clusters, whose differences could be easily identified and interpreted.