To start, choose 3 main parameters.

  1. Distance measure (quantify dissimilarity)
  2. Prototype (summarizes characteristics of all series in a cluster)
  3. Cluster algorithm (most common partitional or hierarchical)

To finish, evaluate results.

  1. Cluster validity indices (CVI)

I. Distance measures

  1. Euclidean distance
  1. Dynamic time warping (DTW) distance
  1. Global alignment kernel (GAK) distance

\[ DTW(x,y) = min_{\pi \in \Lambda (n,m)} D_{x,y}(\pi)\](1.1) \[ DTW(x,y) = \sum_{i=1}^{|\pi|} \varphi(x_{\pi_1 (i)}, y_{\pi_2 (i)})\](1.2) \[ \kappa_{GA}= \sum_{\pi \in \Lambda (n,m)}\prod_{i=1}^{|\pi|}\ K (x_{\pi_1 (i)}, y_{\pi_2 (i)}) \](2)

II. Prototype

  1. Mean or median
  1. Partition around medoids (PAM)
  1. DTW barycenter averaging
  2. Shape extraction
  3. Fuzzy-based prototypes

Note

III. Time-series clustering algorithms

  1. Hierarchical clustering
  1. Partitional clustering

Note

this
this

IV. Cluster evaluation with CVI’s


The code below demonstrates the potential for time-series clustering with the R package dtwclust by Alexis Sarda-Espinosa. This was also the main reference for the majority of the previous notes and figures. The R script is adapted from exercises in the dtwclust vignette. Interesting dependencies from flexclust and dtw.

# synthetically generated control charts
library(tidyr)
library(dtwclust)
library(dplyr)
library(ggplot2)
library(reshape)

df <- read.table("http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data", 
                 header = FALSE)
# wide to long
df_long <- gather(df[c(1:50),c(1:20)]) 

# make a timepoint column
df_long$time <- rep(1:50,20)

# plot by key
df_long  %>% 
  ggplot(aes(x= time, y= value, color= key)) +
  geom_line( size=0.2) +
  ggtitle("Control chart sequences") + 
  facet_wrap(~ key , scales = 'free_x', nrow= 2) 

df_list <- as.list(utils::unstack(df_long, value ~ key))

df_list_z <- dtwclust::zscore(df_list)

#hierarchical clustering with 10% window size for up to k=10 clusters
cluster_dtw_h <-list()
for (i in 2:20)
{
  cluster_dtw_h[[i]] <- tsclust(df_list_z, type = "h", k = i,  distance = "dtw", control = hierarchical_control(method = "complete"), seed = 390, preproc = NULL, args = tsclust_args(dist = list(window.size = 5L)))
}

# take a look at the object
cluster_dtw_h[[20]]
## hierarchical clustering with 20 clusters
## Using dtw distance
## Using PAM (Hierarchical) centroids
## Using method complete 
## 
## Time required for analysis:
##    user  system elapsed 
##   0.998   0.065   1.065 
## 
## Cluster sizes with average intra-cluster distance:
## 
##    size av_dist
## 1     1       0
## 2     1       0
## 3     1       0
## 4     1       0
## 5     1       0
## 6     1       0
## 7     1       0
## 8     1       0
## 9     1       0
## 10    1       0
## 11    1       0
## 12    1       0
## 13    1       0
## 14    1       0
## 15    1       0
## 16    1       0
## 17    1       0
## 18    1       0
## 19    1       0
## 20    1       0

CVI’s.
CVI’s.
# some cluster information
cluster_dtw_h[[4]]@clusinfo
##   size  av_dist
## 1    6 33.11645
## 2    3 28.26739
## 3    2 22.15062
## 4    9 35.22409
# plot dendrogram for k= 4
plot(cluster_dtw_h[[4]])

#  The series and the obtained prototypes can be plotted too
plot(cluster_dtw_h[[4]], type = "sc")

# the representative prototype 
plot(cluster_dtw_h[[4]], type = "centroid")


References

https://cran.r-project.org/web/packages/dtwclust/vignettes/dtwclust.pdf

http://www.sthda.com/english/wiki/print.php?id=237

https://cran.r-project.org/web/packages/dtwclust/dtwclust.pdf

https://cran.r-project.org/web/packages/flexclust/flexclust.pdf

https://cran.r-project.org/web/packages/dtw/dtw.pdf

https://rdrr.io/cran/dtwclust/man/cvi.html

https://www.rdocumentation.org/packages/dtwclust/versions/3.1.1/topics/reinterpolate


Arbelaitz O, Gurrutxaga I, Muguerza J. An extensive comparative study of cluster validity indices. Pattern Recognition. 2013;46:243–256. doi:10.1016/j.patcog.2012.07.021.

Cuturi M (2011). “Fast Global Alignment Kernels.” In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 929–936.

Ratanamahatana, C.A., Keogh, E. Everything you know about Dynamic Time Warping is Wrong. In: Proc. of KDD Workshop on Mining Temporal and Sequential Data. 2004.

Sarda-Espinosa A. Comparing Time-Series Clustering Algorithms in R Using the dtwclust Package. 2017; p. 1–41.

Sarda-Espinosa A. dtwclust: Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance version 5.1.0. 2017.