Introduction

The current document is intended to summarise my learning path towards time series clustering. For the next months, we would like to focus the PhD thesis on learning the state-of-the-art in this field, and apply this knowledge to certain industrial cases. The reason why we want to focus on this field is because in some cases it is necessary to summarise the data in order to understand it properly and propose more advanced models.

Specifically, the thesis is oriented to study data with an inherent cyclic behavior. Data of such nature is generated by a cyclic process, and is captured in the form of a single univariate or multivariate time series. Cycles are captured sequentially, one after the other, thus yielding a very large, almost infinite time series, which may be intractable in computational and memory terms. Moreover, the length of each cycle is not constant.

In this context, we are interested in studying the variability of the cyclical process, and detecting the latent states in which the process can work.

At this moment, the projects that could benefit from applying time series clustering techniques are:

Industrial manufacturing with hydraulic press
Daily demand of electricity
Daily use of bike sharing systems, per station

Note on cycle definition: A seasonal pattern exists when a series is influenced by seasonal factors. Seasonality is always of a fixed and known period. A cyclic pattern exists when data exhibit rises and falls that are not of fixed period.

The difference between seasonal and cyclical behavior has to do with how regular the period of change is. A seasonal behavior is very strictly regular, meaning there is a precise amount of time between the peaks and troughs of the data. For instance temperature would have a seasonal behavior.

Cyclical behavior on the other hand can drift over time because the time between periods isn’t precise. For example, the stock market tends to cycle between periods of high and low values, but there is no set amount of time between those fluctuations.

Resources

R Packages

Time Series Analysis TSclust, TSdist, TSrepr, BNPTSclust, pdc, dtwclust, feasts, tsfeatures, rucrdtw, IncDTW, fable NOT TS-SPECIFIC: depmixS4, NbClust, cluster, clusterSim

Python Libraries

clustering: similaritymeasures, hctsa, khiva, tsfresh, tslearn, seglearn, stumpy, matrixprofile-ts, tf, stumpy, pyts, sktime, pyclustering, pyflux ts: flint, cesium, statsmodels, prophet, nitime, pastas, pysf support: arrow, featuretools, tspreprocess, traces

Databases

InfluxDB, Timescale, Prometheus, OpenTSDB, KairosDB, Heroic, BTrDB, Vaultaire, Beringei, Atlas, Khronus, HawkularMetrics, Blueflood, Newts, Akumuli

Datasets

Time Series Classification Repository, by UEA & UCR (2018) KEEL-dataset repository, by J. Alvalá-Fdez (2011)

Time series clustering

Time Series Taxonomy

Definitions

Univariate TS Multivariate TS Subsequence TS

Main Techniques for TS

TS segmentation TS motif detection TS subsequence matching TS anomaly detection TS classification TS forecasting TS clustering

Clustering definition:

Roelofsen (2018): clustering is the practice of finding hidden patterns or similar groups in data.

Marques (2018): Clustering is an unsupervised data mining technique, which goal is to form homogeneous groups, or clusters of objects, with minimum inter-cluster and maximum intra-cluster similarity. It is an exploratory technique in time-series visualization, in which we construct clusters as we consider the entire series as a whole. To start, choose 3 main parameters: a) Distance measure (quantify dissimilarity), b) Prototype (summarizes characteristics of all series in a cluster), c) Cluster algorithm (most common partitional or hierarchical). To finish, evaluate results: cluster validity indices (CVI)

Roelofsen (2018): Cluster analysis can be divided into three different parts: a) determining a measure to quantify the similarity between observations b) choosing the method to obtain the clustering c) selecting the desired number of clusters

Main Strategies for Clustering

Whole time series clustering

Roelofsen (2018): whole (complete) time series of equal length are compared

Subsequence time series clustering

Roelofsen (2018): a given subsequence is compared with the subsequences of other whole time series

Time point clustering

Distance / Similarity Measures

Notation based on Roelofsen (2018) \(d(x,y)\): distance between time series \(x = (x_1, ..., x_n)\) and \(y = (y_1, ..., y_m)\). Time series \(x\) and \(y\) can be seen as numerical vectors with dimensions \(n\) and \(m\), such that \(x \in \mathbb{R}^n\) and \(y \in \mathbb{R}^m\), and \(n\) and \(m\) can differ in value. The value of \(d(x,y)>=0\) is a single non-negative number that depends on the used distance measure. The lower the value of \(d(x,y)\), the closer the two observations are according to the chosen distance measure. The total number of time series will be indicated by \(N\).
Time complexity by Roelofsen (2018) Most clustering methods require computing the distance or dissimilarity matrix, which contains the distance between each pair of observations. When determining the distance matrix for N different observations, a total of \(\frac{N(N-1)}{2}\) distances have to be calculated. Thus, the required calculation time for determining the distance matrix grows with \(O(N^2)\), and this is without taking into account the time complexity of the individual distance measures. Due to the quadratic time complexity, determining the distance matrix for large sets of observations (\(\gg 1000\)) can be a time consuming operation.

Taxonomy of distance measures, by Roelofsen (2018)

suppressMessages(library(dplyr))
suppressMessages(library(TSdist))
data("example.database")
data("example.database2")
data("example.database3")
data("example.series1")
data("example.series2")
data("example.series3")
data("example.series4")

df <- t(example.database) %>% 
  as.data.frame()
df %>% 
  dplyr::mutate(t = 1:n()) %>% 
  tidyr::gather(key, value, -t) %>% 
  ggplot2::ggplot()+
  ggplot2::geom_line(ggplot2::aes(x=t, y=value, color=key), show.legend = F)+
  ggplot2::facet_wrap(.~key)

N <- length(example.database2$classes)
labels <-  data.frame(key=1:N, label=example.database2$classes)
df <- t(example.database2$data) %>% 
  as.data.frame() %>% 
  `colnames<-`(1:N)

df %>% 
  dplyr::mutate(t = 1:n()) %>% 
  tidyr::gather(key, value, -t) %>% 
  merge(labels) %>% 
  ggplot2::ggplot()+
  ggplot2::geom_line(ggplot2::aes(x=t, y=value, group=key, color=label), alpha=0.5, show.legend = F)+
  ggplot2::facet_wrap(.~label)

Shape-based distances

Compare the overall shape of time series based on the actual (scaled) values of the time series.

Lock-step measures (\(n = m\))

These distance measures require both time series to be of equal length (\(n = m\)) and compare time point \(i\) of time series \(x\) with the same point \(i\) of time series \(y\).

Minkowski distance \(L_{p}\)-norma of the difference between two vectors of equal length (\(n = m\)). \[d_{min} (x,y) = \bigg(\sum_{i=1}^{n} |x_i - y_i|^p \bigg)^{1/p}\] It is the generalization of the commonly used Euclidean distance (\(p=2\)), Manhattan distance (\(p=1\)) and Chebyshev distance (\(p= \infty\)). The time complexity for the Minkowsky distance is \(O(n)\) and thus determining the distance matrix with this measure takes \(O(nN ^2)\) time.

\[d_{euc} (x,y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}\]

x <- example.series1
y <- example.series2
data.frame(x, y) %>% 
  dplyr::mutate(t = 1:n()) %>% 
  tidyr::gather(key, value, -t) %>% 
  ggplot2::ggplot()+
  ggplot2::geom_line(ggplot2::aes(x=t, y=value, color=key))

TSdist::EuclideanDistance(x, y)

## [1] 33.79881

diss <- TSdist::TSDatabaseDistances(example.database2$data, distance="euclidean", diag=T, upper=T) %>% 
  as.matrix()
diss %>% 
  reshape::melt() %>% 
  ggplot2::ggplot()+
  ggplot2::geom_tile(ggplot2::aes(x = X1, y = X2, fill = value))+
  ggplot2::scale_fill_viridis_c()

plot_diss_matrix <- function(diss, data){
  suppressMessages(library(plotly))
  diss[lower.tri(diss)] <- NA
  plotly::plot_ly(hoverinfo = "none") %>% 
    plotly::add_heatmap(z = diss, customdata = list(data)) %>% 
    htmlwidgets::onRender(readLines("tx-inset-plot.js"))
}

plot_diss_matrix(diss, example.database2$data)

\[d_{man} (x,y) = \sum_{i=1}^{n} |x_i - y_i|\] \[d_{che} (x,y) = \max_{i=1,...n} |x_i - y_i|\] - Pearson correlation distance

Takes into account the linear association between two vectors of variables. The Pearson correlation coefficient is defined as: \[ \rho(x,y) = \frac{Cov(x,y)}{\sigma_x\sigma_y} = \frac{\mathbb{E}[(x-\mu_x)(y-\mu_y)]}{\sigma_x\sigma_y} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}\] where \(\mu_x\) and \(\mu_y\) are the means of \(x\) and \(y\) and \(\sigma_x\) and \(\sigma_y\) are the standard deviations of \(x\) and \(y\), respectively. The values of \(\rho\) lie within the range [-1,1], where \(\rho = 1\) indicates a perfect positive relationship between \(x\) and \(y\), \(\rho = -1\) indicates a perfect negative relationship between \(x\) and \(y\), and \(\rho = 0\) indicates no relationship between the two variables. The Pearson correlation distance is defined as: \[d_{cor}(x,y) = 1-\rho(x,y)\] This distance measure can take values in the range [0,2]. The time complexity is \(O(n)\) and thus determining the distance matrix with this measure takes \(O(nN^2)\) time.

Alternative correlation measures include Spearman’s Rank and Kendall’s Tau correlation coefficients. These coefficients indicate correlation based on rank, whereas the Pearson coefficient is based on a linear relationship between two vectors. This makes Spearman’s Rank and Kendall’s Tau less sensitive to noise and outliers. However, this comes with an increase in time complexity: \(O(n\log(n))\) for Spearman’s Rank and \(O(n^2)\) for Kendall’s Tau.

Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable). If Y tends to increase when X increases, the Spearman correlation coefficient is positive. \[\rho_{spearman} = \frac{Cov(rg_x,rg_y)}{\sigma_{rg_x}\sigma_{rg_y}}\] where \(rg_x\) is the rank of the vector \(x\), which requires sorting the vector.

In the same way, Kendall correlation between two variables will be high when observations have a similar rank (i.e. relative position label of the observations within the variable) between the two variables. \[\tau = \frac{concordant_{pairs} - discordant_{pairs}}{N\choose 2}\] Any pair of observations \((x_i, y_i)\) and \((x_j, y_j)\) are said to be concordant if the ranks for both elements agree. In order to count the number of concordant and discordant pairs, it is necessary to compare all pairs of observation, thus the time complexity \(O(n^2)\).

plot_coefficients <- function(x,y){
  plot(x, y)
  title(paste0(" Pearson: ", round(cor(x, y, method="pearson"), 2),
               "\nSpearman: ", round(cor(x, y, method="spearman"), 2),
               "\nKendall: ", round(cor(x, y, method="kendall"), 2)))
}
par(mfrow=c(1,3))
x <- runif(100)
plot_coefficients(x,x + rnorm(100, 0, 0.1))
plot_coefficients(x,100*(x-0.5)^3 + x + rnorm(100, 0, 1))
plot_coefficients(x,10*x^2 + x + rnorm(100, 0, 0.1))

par(mfrow=c(1,1))

Another alternative is a distance based on autocorrelation, which accounts for lags in time and has a time complexity of \(O(n\log(n))\).

Elastic measures (\(n \not= m\))

Elastic measures allow one-to-many and one-to-none matching, and are able to warp in time and be more robust in handling outliers. The main disadvantage, however, is that elastic distance measures generally come with and increase in time complexity.

Dynamic time warping

DTW warps two sequences \(x\) and \(y\) non-linearly in time in order to cope with time deformations and varying speeds in time dependent data. When determining the DTW distance between two time series, first and \((n x m)\) local cost matrix (LCM) is calculated, where element \((i,j)\) contain the distance between \(x_i\) and \(y_j\). This distance is usually defined as the quadratic difference: \(d(x_i, y_j) = (x_i, y_j)^2\). Next, a warping path \(W = w_1, w_2, ..., w_K\) is determined, where \(\max(n,m) < K < m+n-1\). This path transverses the LCM under three constraints:

Boundary condition: path must start and end in the diagonal corners \(w_1=(1,1)\), \(w_k=(n,m)\).
Continuity: only adjacent elements in the matrix are allowed for steps in the path.
Monotonicity: subsequent steps in the path must be monotonically spaced in time.

The total distance for path \(W\) is obtained by summing the individual elements (distances) of the LCM that the path traverses. To obtain the DTW distance, the path with minimum total distance is required. This path can be obtained by an \(O(nm)\) algorithm that is based on dynamic programming: \[d_{cum}(i,j)=d(x_i, x_j) + \min\{d_{cum}(i-1,j-1), d_{cum}(i,j-1), d_{cum}(i-1,j)\}\] \[d_{DTW}(x,y) = min \sqrt{\sum_{k=1}^{K}w_k}\] This distance is equal to the Euclidean distance for the case where \(n=m\) and only the diagonal of the LCM is traversed. DTW does not satisfy the triangle inequality, even when the local distance measure is a metric.

A number of modifications to the DTW distance are proposed in the literature. Derivative Dynamic Time Warping (DDTW) converts the time series into a series of first order differences. This avoids having a single point of one series to map to many point in another series, which can negatively impact the DTW distance.

Weighted Dynamic Time Warping (WDTW) uses a logistic weight function to add penalties to the LCM to favour reduced warping, which brings more balance between shape matching and alignment in time.

Time complexity: \(O(mn)\) for finding the DTW distance between two time series and thus \(O(nmN^2)\) for the entire distance matrix. This quadratic time complexity can make DTW a time consuming process, hence different methods are proposed in the literature to speed up the distance measure, with the drawback of deviating from the true DTW distance. These methods fall into three categories: adding constraints, abstracting the data and indexing.

Constraints: limiting the number of elements in the LCM that are evaluated: Sakoe-Chuba band and the Itakira Parallelogram. Abstracting the data: the DTW algorithm is only run on a reduced representation of the data. Indexing: uses lower bounds to reduce the number of times the DTW algorithm has to be executed.

These methods increase the speed by a constant factor, thus remaining on \(O(mn)\). FastDTW is an approximation algorithm that has a time complexity of \(O(max(m,n))\). FastDTW reduces the dimensions of the time series by averaging adjacent pairs of points in time, then finds the minimum warping path through local adjustments by considering neighbouring cells in the original LCM. Even though FastDTW has a linear time complexity, accurate approximations come with large constant factors in time complexity.

Longest Common Subsequence

Finding the longest subsequence that is common to two or more sequences. Here a subsequences is define as a sequence that appears in the same relative order, but where the individual elements are not necessarily contiguous. \[ L(i,j) \begin{cases} 0 & \text{if } i=0 \text{ or } j=0 \\ 1 + L(i-1, j-1) & \text{for } |x_i-y_j|<\epsilon \\ max\{L(i-1,j),L(i,j-1)\} & \text{otherwise} \end{cases} \] This distance only regards similar points, so it is robust to noise and outliers. The time complexity is \(O(mn)\), but often a warping threshold \(\delta\) is added to restrict the matching to a maximum difference in time \((|i-j| < \delta)\), improving the time complexity to \(O((n+m)\delta)\). \[d_{LCSS}(x,y) = \frac{n + m - 2L(n,m)}{n+m}\]

n <- 100
x <- arima.sim(model = list(ar = 0.99), n=n)
m <- 80
y <- arima.sim(model = list(ar = 0.99), n=m)
eps <- 1
delta <- 5
L <- matrix(nrow=n, ncol=m)
for(i in 1:n){
  for(j in 1:m){
    if(i == 1 || j == 1){
      L[i,j] <- 0
    }else if(abs(x[i] - y[j]) < eps && abs(i-j) < delta){
      L[i,j] <- 1 + L[i-1, j-1]
    }else{
      L[i,j] <- max(L[i-1, j], L[i, j-1])
    }
  }
}
plot(x, col="blue")
lines(y, col="red")
title(paste0("LCSS: ", round(L[n,m], 2)))

heatmap(L, Rowv = NA, Colv = NA)

d <- (n + m - 2*L[n,m])/(m+n)

Feature-based distances

First extract features from the time series and then measures the distance between these features. Feature-based distances are often applied to obtain a reduction in both dimensionality and noise.

Principal Component Analysis (PCA)

Dimensionality reduction method to explain the variance-covariance structure of a set of variables through linear combinations. While PCA can be an effective method in general clustering, where it reduces the number of dimensions while retaining most of the information, it is not as effective for time series clustering, since it does not take into account the ordering of the data, making it unable to capture time-evolving and time-shifted patterns.

Discrete Fourier Transform (sine waves)

Transforms the time series from the “time-domain” \(x(t)\) to a “frequency-domain” representation \(X(f)\). The Fourier transform decomposes a time series into all different cycles (amplitude, offset and rotation speed). DFT is calculated by the inner product of the time series and a sine wave: \[X(f) = \sum_{t=0}^{n-1} x_t \, e^{-i \frac{2\pi f}{n}t}\] The resulting vector \(X(f)\) if a vector of \(n\) complex numbers. The inverse DTF transform a collection of frequencies \(X(f)\) back to the time-domain: \[x(t) =\frac{1}{n}\sum_{f=0}^{n-1} X(f) \, e^{i \frac{2\pi f}{n}t}\] Parseval’s theorem: the DTF preserved the Euclidean distance between two time series: When all the frequencies are use, the Euclidean distance between two Fourier transforms is equal to the Euclidean distance between the original time series. This is caused by the fact that DFT is a linear transformation.

Most of the energy in real-world signals is concentrated in the low frequencies. This is where an advantage of DFT arises: only a limited number of low frequencies \(q\) are required to obtain a good approximation of the original time series. According to the Nyquist-Shannon sampling theorem, only the first \(\frac{n}{2}\) of fewer frequencies should be used (usually the upper bound \(q=\frac{n}{2}\) is used). Besides the advantage of dimensionality reduction, disregarding the higher frequencies also denoises the time series in the approximation.

DTW has a time complexity of \(O(n^2)\) due to the matrix-vector multiplication. This can be improved to \(O(n\log(n))\) by using the Fast Fourier Transform (FFT) algorithms, which compute the same results by factorizing the DFT matrix into a productt of sparse factors to avoid redundant calculations. Calculating the distance, euclidean for example, between two time series based on the Fourier coefficients has time complexity \(O(q)\), and along with the \(O(n\log(n))\) time required for the DFT, the distance matrix takes \(O(Nn\log(n) + qN^2)\).

Discrete Wavelet Transform (wavelets)

DWT is a dimensionality reduction method that also reduces noise. It decomposes a time series into a set of bases functions that are called wavelets. A wavelet is a rapid decaying, wave-like oscilation with mean zero, and have finite duration. Wavelets are defined by two functions: the wavelet function (mother) \(\psi\) and the scaling function (father) \(\varphi\). The mother defines the basic shape, and the father the scale (frequency).

The DWT is obtained by successively passing a time series through high-pass and low-pass filters, which produces detail and approximation coefficients for different levels of decomposition.

Main advantage over DFT: captures frequency and location in time.

Symbolic Aggregate Approximation (SAX) (strings)

Edit-based distances

Based on the minimum bumber of operations that are required to transform one series into the other.

Structure-based distances

Notes: Compare higher level structures, obtained by modelling or compressing the time series. The running time and the obtained clusters for edit and structure based distances tend to be similar or worse compared to shape and feature based distance measures. Lock-step distance measures are often outperformed by elastic ones, due to their sensitivity to noise, scale and time shifts.

Dimensionality Reduction Techniques

PCA

Clustering Method

Partitional Density Hierarchical Model-based

Prototypes

Mean DBA

Evaluation

Internal: dunn, silhouette External: labels, multiclass confusion matrix

# https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
# parameters
b <- 100 #batch size
t <- 100 #iterations

# data
df <- clusteringdatasets::D31
X <- apply(df[,c(1,2)], 2, as.numeric)
y <- df$label
k <- length(unique(df$label))

blobs <- clusteringdatasets::make_blobs(n_samples = 1000000, n_features = 2, centers = 10)
X <- blobs$samples
y <- blobs$labels
k <- length(unique(blobs$labels))

n <- nrow(X)
C <-  X[sample(1:n, size=k),]
distance <- function(c, x){
  return(sum((c-x)^2))
}
nearest_center <- function(C, x){
  error <- apply(C, 1, function(c) distance(c, x))
  return (which.min(error))
}
v <- numeric(n)
d <- numeric(n)
for (j in 1:t) {
  M <- sample(1:n, size=b)
  for (i in M) {
    d[i] <- nearest_center(C, X[i, ])
  }
  for (i in M) {
    c <- d[i]
    v[c] <- v[c] + 1
    lr <- 1/v[c]
    C[c, ] <- (1-lr)*C[c, ] + lr*X[i, ]
  }
}

s <- sample(1:n, size = 1000)
plot(X[s, ], col=rainbow(k)[y[s]])
points(C, pch=21, cex=1, bg="black")

# ARIMA
x <- arima.sim(model = list(ar=0.999), n=100)
plot(x)

# AR
n <- 100
x <- numeric(n)
x[1] <- 0
for(i in 2:n){
  x[i] <- 0.9*x[i-1] + rnorm(1, 0, 0.1)
}
plot(x); lines(x)

# MA
n <- 100
x <- numeric(n)
eps <- numeric(n)
x[1] <- 0
eps[1] <- 0
for(i in 2:n){
  eps[i] <- 0.999*eps[i-1] + rnorm(1, 0, 0.1)
  x[i] <- eps[i] + rnorm(1, 0, 0.1)
}
plot(x); lines(x)

Marques, A., 2018, at http://rstudio-pubs-static.s3.amazonaws.com/398402_abe1a0343a4e4e03977de8f3791e96bb.html.

Roelofsen, P., 2018, Time series clustering: Vrije Universiteit Amsterdam, Amsterdam.