Taxonomy of distance measures, by Roelofsen (2018)
suppressMessages(library(dplyr))
suppressMessages(library(TSdist))
data("example.database")
data("example.database2")
data("example.database3")
data("example.series1")
data("example.series2")
data("example.series3")
data("example.series4")
df <- t(example.database) %>%
as.data.frame()
df %>%
dplyr::mutate(t = 1:n()) %>%
tidyr::gather(key, value, -t) %>%
ggplot2::ggplot()+
ggplot2::geom_line(ggplot2::aes(x=t, y=value, color=key), show.legend = F)+
ggplot2::facet_wrap(.~key)

N <- length(example.database2$classes)
labels <- data.frame(key=1:N, label=example.database2$classes)
df <- t(example.database2$data) %>%
as.data.frame() %>%
`colnames<-`(1:N)
df %>%
dplyr::mutate(t = 1:n()) %>%
tidyr::gather(key, value, -t) %>%
merge(labels) %>%
ggplot2::ggplot()+
ggplot2::geom_line(ggplot2::aes(x=t, y=value, group=key, color=label), alpha=0.5, show.legend = F)+
ggplot2::facet_wrap(.~label)

- Shape-based distances
Compare the overall shape of time series based on the actual (scaled) values of the time series.
- Lock-step measures (\(n = m\))
These distance measures require both time series to be of equal length (\(n = m\)) and compare time point \(i\) of time series \(x\) with the same point \(i\) of time series \(y\).
- Minkowski distance \(L_{p}\)-norma of the difference between two vectors of equal length (\(n = m\)). \[d_{min} (x,y) = \bigg(\sum_{i=1}^{n} |x_i - y_i|^p \bigg)^{1/p}\] It is the generalization of the commonly used Euclidean distance (\(p=2\)), Manhattan distance (\(p=1\)) and Chebyshev distance (\(p= \infty\)). The time complexity for the Minkowsky distance is \(O(n)\) and thus determining the distance matrix with this measure takes \(O(nN ^2)\) time.
\[d_{euc} (x,y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}\]
x <- example.series1
y <- example.series2
data.frame(x, y) %>%
dplyr::mutate(t = 1:n()) %>%
tidyr::gather(key, value, -t) %>%
ggplot2::ggplot()+
ggplot2::geom_line(ggplot2::aes(x=t, y=value, color=key))

TSdist::EuclideanDistance(x, y)
## [1] 33.79881
diss <- TSdist::TSDatabaseDistances(example.database2$data, distance="euclidean", diag=T, upper=T) %>%
as.matrix()
diss %>%
reshape::melt() %>%
ggplot2::ggplot()+
ggplot2::geom_tile(ggplot2::aes(x = X1, y = X2, fill = value))+
ggplot2::scale_fill_viridis_c()

plot_diss_matrix <- function(diss, data){
suppressMessages(library(plotly))
diss[lower.tri(diss)] <- NA
plotly::plot_ly(hoverinfo = "none") %>%
plotly::add_heatmap(z = diss, customdata = list(data)) %>%
htmlwidgets::onRender(readLines("tx-inset-plot.js"))
}
plot_diss_matrix(diss, example.database2$data)
\[d_{man} (x,y) = \sum_{i=1}^{n} |x_i - y_i|\] \[d_{che} (x,y) = \max_{i=1,...n} |x_i - y_i|\] - Pearson correlation distance
Takes into account the linear association between two vectors of variables. The Pearson correlation coefficient is defined as: \[ \rho(x,y) = \frac{Cov(x,y)}{\sigma_x\sigma_y} = \frac{\mathbb{E}[(x-\mu_x)(y-\mu_y)]}{\sigma_x\sigma_y} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\] where \(\mu_x\) and \(\mu_y\) are the means of \(x\) and \(y\) and \(\sigma_x\) and \(\sigma_y\) are the standard deviations of \(x\) and \(y\), respectively. The values of \(\rho\) lie within the range [-1,1], where \(\rho = 1\) indicates a perfect positive relationship between \(x\) and \(y\), \(\rho = -1\) indicates a perfect negative relationship between \(x\) and \(y\), and \(\rho = 0\) indicates no relationship between the two variables. The Pearson correlation distance is defined as: \[d_{cor}(x,y) = 1-\rho(x,y)\] This distance measure can take values in the range [0,2]. The time complexity is \(O(n)\) and thus determining the distance matrix with this measure takes \(O(nN^2)\) time.
Alternative correlation measures include Spearman’s Rank and Kendall’s Tau correlation coefficients. These coefficients indicate correlation based on rank, whereas the Pearson coefficient is based on a linear relationship between two vectors. This makes Spearman’s Rank and Kendall’s Tau less sensitive to noise and outliers. However, this comes with an increase in time complexity: \(O(n\log(n))\) for Spearman’s Rank and \(O(n^2)\) for Kendall’s Tau.
Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable). If Y tends to increase when X increases, the Spearman correlation coefficient is positive. \[\rho_{spearman} = \frac{Cov(rg_x,rg_y)}{\sigma_{rg_x}\sigma_{rg_y}}\] where \(rg_x\) is the rank of the vector \(x\), which requires sorting the vector.
In the same way, Kendall correlation between two variables will be high when observations have a similar rank (i.e. relative position label of the observations within the variable) between the two variables. \[\tau = \frac{concordant_{pairs} - discordant_{pairs}}{N\choose 2}\] Any pair of observations \((x_i, y_i)\) and \((x_j, y_j)\) are said to be concordant if the ranks for both elements agree. In order to count the number of concordant and discordant pairs, it is necessary to compare all pairs of observation, thus the time complexity \(O(n^2)\).
plot_coefficients <- function(x,y){
plot(x, y)
title(paste0(" Pearson: ", round(cor(x, y, method="pearson"), 2),
"\nSpearman: ", round(cor(x, y, method="spearman"), 2),
"\nKendall: ", round(cor(x, y, method="kendall"), 2)))
}
par(mfrow=c(1,3))
x <- runif(100)
plot_coefficients(x,x + rnorm(100, 0, 0.1))
plot_coefficients(x,100*(x-0.5)^3 + x + rnorm(100, 0, 1))
plot_coefficients(x,10*x^2 + x + rnorm(100, 0, 0.1))

par(mfrow=c(1,1))
Another alternative is a distance based on autocorrelation, which accounts for lags in time and has a time complexity of \(O(n\log(n))\).
- Elastic measures (\(n \not= m\))
Elastic measures allow one-to-many and one-to-none matching, and are able to warp in time and be more robust in handling outliers. The main disadvantage, however, is that elastic distance measures generally come with and increase in time complexity.
DTW warps two sequences \(x\) and \(y\) non-linearly in time in order to cope with time deformations and varying speeds in time dependent data. When determining the DTW distance between two time series, first and \((n x m)\) local cost matrix (LCM) is calculated, where element \((i,j)\) contain the distance between \(x_i\) and \(y_j\). This distance is usually defined as the quadratic difference: \(d(x_i, y_j) = (x_i, y_j)^2\). Next, a warping path \(W = w_1, w_2, ..., w_K\) is determined, where \(\max(n,m) < K < m+n-1\). This path transverses the LCM under three constraints:
- Boundary condition: path must start and end in the diagonal corners \(w_1=(1,1)\), \(w_k=(n,m)\).
- Continuity: only adjacent elements in the matrix are allowed for steps in the path.
- Monotonicity: subsequent steps in the path must be monotonically spaced in time.
The total distance for path \(W\) is obtained by summing the individual elements (distances) of the LCM that the path traverses. To obtain the DTW distance, the path with minimum total distance is required. This path can be obtained by an \(O(nm)\) algorithm that is based on dynamic programming: \[d_{cum}(i,j)=d(x_i, x_j) + \min\{d_{cum}(i-1,j-1), d_{cum}(i,j-1), d_{cum}(i-1,j)\}\] \[d_{DTW}(x,y) = min \sqrt{\sum_{k=1}^{K}w_k}\] This distance is equal to the Euclidean distance for the case where \(n=m\) and only the diagonal of the LCM is traversed. DTW does not satisfy the triangle inequality, even when the local distance measure is a metric.
A number of modifications to the DTW distance are proposed in the literature. Derivative Dynamic Time Warping (DDTW) converts the time series into a series of first order differences. This avoids having a single point of one series to map to many point in another series, which can negatively impact the DTW distance.
Weighted Dynamic Time Warping (WDTW) uses a logistic weight function to add penalties to the LCM to favour reduced warping, which brings more balance between shape matching and alignment in time.
Time complexity: \(O(mn)\) for finding the DTW distance between two time series and thus \(O(nmN^2)\) for the entire distance matrix. This quadratic time complexity can make DTW a time consuming process, hence different methods are proposed in the literature to speed up the distance measure, with the drawback of deviating from the true DTW distance. These methods fall into three categories: adding constraints, abstracting the data and indexing.
Constraints: limiting the number of elements in the LCM that are evaluated: Sakoe-Chuba band and the Itakira Parallelogram. Abstracting the data: the DTW algorithm is only run on a reduced representation of the data. Indexing: uses lower bounds to reduce the number of times the DTW algorithm has to be executed.
These methods increase the speed by a constant factor, thus remaining on \(O(mn)\). FastDTW is an approximation algorithm that has a time complexity of \(O(max(m,n))\). FastDTW reduces the dimensions of the time series by averaging adjacent pairs of points in time, then finds the minimum warping path through local adjustments by considering neighbouring cells in the original LCM. Even though FastDTW has a linear time complexity, accurate approximations come with large constant factors in time complexity.
- Longest Common Subsequence
Finding the longest subsequence that is common to two or more sequences. Here a subsequences is define as a sequence that appears in the same relative order, but where the individual elements are not necessarily contiguous. \[
L(i,j)
\begin{cases}
0 & \text{if } i=0 \text{ or } j=0 \\
1 + L(i-1, j-1) & \text{for } |x_i-y_j|<\epsilon \\
max\{L(i-1,j),L(i,j-1)\} & \text{otherwise}
\end{cases}
\] This distance only regards similar points, so it is robust to noise and outliers. The time complexity is \(O(mn)\), but often a warping threshold \(\delta\) is added to restrict the matching to a maximum difference in time \((|i-j| < \delta)\), improving the time complexity to \(O((n+m)\delta)\). \[d_{LCSS}(x,y) = \frac{n + m - 2L(n,m)}{n+m}\]
n <- 100
x <- arima.sim(model = list(ar = 0.99), n=n)
m <- 80
y <- arima.sim(model = list(ar = 0.99), n=m)
eps <- 1
delta <- 5
L <- matrix(nrow=n, ncol=m)
for(i in 1:n){
for(j in 1:m){
if(i == 1 || j == 1){
L[i,j] <- 0
}else if(abs(x[i] - y[j]) < eps && abs(i-j) < delta){
L[i,j] <- 1 + L[i-1, j-1]
}else{
L[i,j] <- max(L[i-1, j], L[i, j-1])
}
}
}
plot(x, col="blue")
lines(y, col="red")
title(paste0("LCSS: ", round(L[n,m], 2)))

heatmap(L, Rowv = NA, Colv = NA)

d <- (n + m - 2*L[n,m])/(m+n)
- Feature-based distances
First extract features from the time series and then measures the distance between these features. Feature-based distances are often applied to obtain a reduction in both dimensionality and noise.
- Principal Component Analysis (PCA)
Dimensionality reduction method to explain the variance-covariance structure of a set of variables through linear combinations. While PCA can be an effective method in general clustering, where it reduces the number of dimensions while retaining most of the information, it is not as effective for time series clustering, since it does not take into account the ordering of the data, making it unable to capture time-evolving and time-shifted patterns.
- Discrete Fourier Transform (sine waves)
Transforms the time series from the “time-domain” \(x(t)\) to a “frequency-domain” representation \(X(f)\). The Fourier transform decomposes a time series into all different cycles (amplitude, offset and rotation speed). DFT is calculated by the inner product of the time series and a sine wave: \[X(f) = \sum_{t=0}^{n-1} x_t \, e^{-i \frac{2\pi f}{n}t}\] The resulting vector \(X(f)\) if a vector of \(n\) complex numbers. The inverse DTF transform a collection of frequencies \(X(f)\) back to the time-domain: \[x(t) =\frac{1}{n}\sum_{f=0}^{n-1} X(f) \, e^{i \frac{2\pi f}{n}t}\] Parseval’s theorem: the DTF preserved the Euclidean distance between two time series: When all the frequencies are use, the Euclidean distance between two Fourier transforms is equal to the Euclidean distance between the original time series. This is caused by the fact that DFT is a linear transformation.
Most of the energy in real-world signals is concentrated in the low frequencies. This is where an advantage of DFT arises: only a limited number of low frequencies \(q\) are required to obtain a good approximation of the original time series. According to the Nyquist-Shannon sampling theorem, only the first \(\frac{n}{2}\) of fewer frequencies should be used (usually the upper bound \(q=\frac{n}{2}\) is used). Besides the advantage of dimensionality reduction, disregarding the higher frequencies also denoises the time series in the approximation.
DTW has a time complexity of \(O(n^2)\) due to the matrix-vector multiplication. This can be improved to \(O(n\log(n))\) by using the Fast Fourier Transform (FFT) algorithms, which compute the same results by factorizing the DFT matrix into a productt of sparse factors to avoid redundant calculations. Calculating the distance, euclidean for example, between two time series based on the Fourier coefficients has time complexity \(O(q)\), and along with the \(O(n\log(n))\) time required for the DFT, the distance matrix takes \(O(Nn\log(n) + qN^2)\).
- Discrete Wavelet Transform (wavelets)
DWT is a dimensionality reduction method that also reduces noise. It decomposes a time series into a set of bases functions that are called wavelets. A wavelet is a rapid decaying, wave-like oscilation with mean zero, and have finite duration. Wavelets are defined by two functions: the wavelet function (mother) \(\psi\) and the scaling function (father) \(\varphi\). The mother defines the basic shape, and the father the scale (frequency).
The DWT is obtained by successively passing a time series through high-pass and low-pass filters, which produces detail and approximation coefficients for different levels of decomposition.
Main advantage over DFT: captures frequency and location in time.
- Symbolic Aggregate Approximation (SAX) (strings)
- Edit-based distances
Based on the minimum bumber of operations that are required to transform one series into the other.
- Structure-based distances
Notes: Compare higher level structures, obtained by modelling or compressing the time series. The running time and the obtained clusters for edit and structure based distances tend to be similar or worse compared to shape and feature based distance measures. Lock-step distance measures are often outperformed by elastic ones, due to their sensitivity to noise, scale and time shifts.