Functional data clustering with R

stat17_hb @ korea.ac.kr

1 Intro

Functional data analysis (FDA) deals with data that “provides information about curves, surfaces or anything else varying over a continuum.” This task view catalogues available packages in this rapidly developing field.

FDA는 간단히 말해 데이터가 함수 형태인 경우에 대한 분석이다. functional data에 대한 클러스터링도 생각해 볼 수 있는데 CRAN에는 functional data clustering을 위한 아래의 4가지 팩키지가 올라와 있다. 이 중 funFEM, funHDDC, fdakma 패키지의 사용법에 대해 정리해보고자 한다.

https://cran.r-project.org/web/views/FunctionalData.html

1~3 번째 패키지는 Charles Bouveyron이 공통적으로 저자에 포함되어 있는데, 프랑스 Université Côte d’Azur의 통계학과 교수로 재직 중이고, Functional data clustering 관련 논문들을 다수 쓰신 분이다.


2 funFEM

2.1 Description

The funFEM algorithm allows to cluster time series or, more generally, functional data. It is based on a discriminative functional mixture model which allows the clustering of the data in a unique and discriminative functional subspace. This model presents the advantage to be parsimonious and can therefore handle long time series.

2.2 Usage

funFEM(fd, K=2:6, model = "AkjBk", crit = "bic", init = "hclust", Tinit = c(), maxit = 50,
       eps = 1e-06, disp = FALSE, lambda = 0, graph = FALSE)

2.3 Arguments

names description
fd a functional data object produced by the fda package.
K an integer vector specifying the numbers of mixture components (clusters) among which the model selection criterion will choose the most appropriate number of groups. Default is 2:6.
model a vector of discriminative latent mixture (DLM) models to fit. There are 12 different models: “DkBk”, “DkB”, “DBk”, “DB”, “AkjBk”, “AkjB”, “AkBk”, “AkBk”, “AjBk”, “AjB”, “ABk”, “AB”. The option “all” executes the funFEM algorithm on the 12 models and select the best model according to the maximum value obtained by model selection criterion.
crit the criterion to be used for model selection (‘bic’, ‘aic’ or ‘icl’). ‘bic’ is the default.
init the initialization type (‘random’, ‘kmeans’ of ‘hclust’). ‘hclust’ is the default.
Tinit a n x K matrix which contains posterior probabilities for initializing the algorithm (each line corresponds to an individual).
maxit the maximum number of iterations before the stop of the Fisher-EM algorithm.
eps the threshold value for the likelihood differences to stop the Fisher-EM algorithm.
disp if true, some messages are printed during the clustering. Default is false.
lambda the l0 penalty (between 0 and 1) for the sparse version. See (Bouveyron et al., 2014) for details. Default is 0.
graph if true, it plots the evolution of the log-likelhood. Default is false.

2.4 Example

다음은 R에서 제공하는 예제이다. Funtional Data Analysis 분야에서 유명한 교재인 Ramsay & Silverman에 나오는 Canadian temperature 데이터를 사용한다.

library(funFEM)

head(day.5)
jan01 jan02 jan03 jan04 jan05 jan06 
  0.5   1.5   2.5   3.5   4.5   5.5 
CanadianWeather_Temp <- CanadianWeather$dailyAv[,,"Temperature.C"]
head(CanadianWeather_Temp)
      St. Johns Halifax Sydney Yarmouth Charlottvl Fredericton Scheffervll
jan01      -3.6    -4.4   -3.8     -1.4       -5.8        -7.9       -22.5
jan02      -3.1    -4.2   -3.5     -1.6       -5.6        -7.5       -23.0
jan03      -3.4    -5.3   -4.6     -2.5       -7.3        -9.3       -23.0
jan04      -4.4    -5.4   -5.0     -2.3       -7.0        -8.7       -21.8
jan05      -2.9    -5.6   -4.1     -2.4       -6.7        -9.1       -23.5
jan06      -4.5    -7.1   -6.1     -3.7       -8.9       -10.9       -24.4
      Arvida Bagottville Quebec Sherbrooke Montreal Ottawa Toronto London
jan01  -14.1       -14.6  -10.8      -10.1     -8.7   -9.4    -5.4   -5.2
jan02  -14.4       -14.7  -11.4      -10.5     -9.9  -10.4    -5.4   -5.7
jan03  -15.0       -15.5  -12.5      -11.4     -9.8   -9.9    -5.0   -5.3
jan04  -14.3       -14.3  -11.2      -10.2     -9.0   -9.2    -5.9   -5.8
jan05  -16.2       -15.8  -12.0      -11.3     -9.8  -10.1    -5.7   -6.5
jan06  -16.3       -16.9  -13.2      -12.6    -10.7  -11.0    -5.6   -5.6
      Thunder Bay Winnipeg The Pas Churchill Regina Pr. Albert
jan01       -14.0    -17.9   -20.5     -24.9  -15.7      -19.3
jan02       -14.0    -16.9   -19.6     -25.0  -15.0      -18.0
jan03       -13.5    -17.7   -20.6     -26.5  -15.8      -18.5
jan04       -13.9    -18.5   -21.0     -26.7  -17.3      -19.4
jan05       -14.4    -18.2   -21.7     -26.7  -17.0      -20.3
jan06       -14.5    -18.4   -22.6     -28.5  -17.1      -20.3
      Uranium City Edmonton Calgary Kamloops Vancouver Victoria Pr. George
jan01        -23.5    -12.6    -8.6     -5.3       2.3      3.0      -10.5
jan02        -23.6    -12.7    -8.4     -5.6       2.1      2.7      -11.4
jan03        -24.6    -14.2    -9.2     -6.5       1.9      2.7      -11.0
jan04        -25.5    -14.8   -10.8     -6.8       2.0      2.9      -11.9
jan05        -26.5    -15.4   -12.0     -7.3       1.6      2.6      -13.5
jan06        -27.9    -15.1   -11.7     -7.9       1.4      2.1      -13.2
      Pr. Rupert Whitehorse Dawson Yellowknife Iqaluit Inuvik Resolute
jan01        0.4      -17.6  -28.0       -24.5   -23.3  -26.3    -30.7
jan02        0.5      -18.6  -28.9       -25.3   -24.0  -27.1    -30.6
jan03       -0.2      -18.6  -29.5       -26.1   -24.4  -27.8    -31.4
jan04       -0.6      -20.0  -29.0       -27.7   -24.7  -28.2    -31.9
jan05       -1.0      -20.9  -28.8       -27.8   -25.3  -27.2    -31.5
jan06       -0.6      -21.0  -29.9       -28.1   -26.3  -27.6    -31.1
dim(CanadianWeather_Temp)
[1] 365  35
  • day.5에는 365일이 0.5부터 364.5까지의 값으로 들어가 있는데 day.5는 365일에 대한 그리드의 역할을 한다.

  • CanadianWeather 데이터의 dailyAV(일평균) 중 Temperature.C(섭씨온도)를 분석에 사용한다. 캐나다에 있는 35개 도시에 대한 데이터이다.

# Clustering the well-known "Canadian temperature" data (Ramsay & Silverman)
basis <- create.bspline.basis(c(0, 365), nbasis=21, norder=4) # norder=4 : cubic spline
fdobj <- smooth.basis(day.5, CanadianWeather_Temp, 
                      basis, fdnames=list("Day", "Station", "Deg C"))$fd
res <- funFEM(fdobj, K=4)
  • 분석은 basis 생성 -> smoothing -> 클러스터링의 순서로 이루어진다.

  • create.bspline.basis() 함수와 smooth.basis() 함수는 fda 팩키지에 포함되어 있다.

create.bspline.basis() 함수에 대한 설명은 아래와 같다.

Functional data objects are constructed by specifying a set of basis functions and a set of coefficients defining a linear combination of these basis functions. The B-spline basis is used for non-periodic functions. B-spline basis functions are polynomial segments jointed end-to-end at at argument values called knots, breaks or join points. The segments have specifiable smoothness across these breaks. B-splne basis functions have the advantages of very fast computation and great flexibility.

B-spline이 무엇인지 위키에 검색해보면

B-spline

In the mathematical subfield of numerical analysis, a B-spline, or basis spline, is a spline function that has minimal support with respect to a given degree, smoothness, and domain partition. Any spline function of given degree can be expressed as a linear combination of B-splines of that degree. Cardinal B-splines have knots that are equidistant from each other. B-splines can be used for curve-fitting and numerical differentiation of experimental data.

B는 Basis의 약자이고, 스플라인은

스플라인 곡선은 주어진 복수의 제어점을 통과하는 부드러운 곡선으로, 인접한 두 점 사이에의 구간마다 별도의 다항식을 이용해 곡선을 정의한다.

고 하는데 그림을 보면 대강 무슨 말인지 이해할 수 있다.

funFEM() 함수를 통해 리턴되는 값은 다음과 같다.

name description
model the model name.
K the number of groups.
cls the group membership of each individual estimated by the Fisher-EM algorithm.
P the posterior probabilities of each individual for each group.
prms the model parameters.
U the orientation of the functional subspace according to the basis functions.
aic the value of the Akaike information criterion.
bic the value of the Bayesian information criterion.
icl the value of the integrated completed likelihood criterion.
loglik the log-likelihood values computed at each iteration of the FEM algorithm.
ll the log-likelihood value obtained at the last iteration of the FEM algorithm.
nbprm the number of free parameters in the model.
call the call of the function.
plot some information to pass to the plot.fem function.
crit the model selction criterion used.
# Visualization of the partition and the group means
par(mfrow=c(1,2))
plot(fdobj, col=res$cls, lwd=2, lty=1)
[1] "done"
fdmeans <- fdobj
fdmeans$coefs <- t(res$prms$my)
plot(fdmeans, col=1:max(res$cls), lwd=2)

[1] "done"
str(fdmeans)
List of 3
 $ coefs  : num [1:21, 1:4] -6.17 -10.48 -6.83 -8.69 -4.34 ...
 $ basis  :List of 10
  ..$ call       : language basisfd(type = type, rangeval = rangeval, nbasis = nbasis, params = params,      dropind = dropind, quadvals = qu| __truncated__
  ..$ type       : chr "bspline"
  ..$ rangeval   : num [1:2] 0 365
  ..$ nbasis     : num 21
  ..$ params     : num [1:17] 20.3 40.6 60.8 81.1 101.4 ...
  ..$ dropind    : NULL
  ..$ quadvals   : NULL
  ..$ values     : list()
  ..$ basisvalues: list()
  ..$ names      : chr [1:21] "bspl4.1" "bspl4.2" "bspl4.3" "bspl4.4" ...
  ..- attr(*, "class")= chr "basisfd"
 $ fdnames:List of 3
  ..$ : chr "Day"
  ..$ : chr "Station"
  ..$ : chr "Deg C"
 - attr(*, "class")= chr "fd"
  • fdmeans에 coefs로 들어가는 res$prms$my 는 nbasis(basis 수) x K(클러스터 수)의 차원으로 되어 있는데 클러스터 마다 basis에 해당하는 계수 값들이 있어야 그래프를 그릴 수 있는 것 같다.

  • 클러스터링 결과는 fda 팩키지의 plot.fd() 함수로 시각화 할 수 있다. 결과를 보면 35개의 곡선이 4개의 클러스터로 나누어지는 것을 볼 수 있다.

2.5 Summary

  • B-spline을 이용한 functional data clustering

  • Reference를 보면 multivariate case도 가능한 것 같은데 Help에 이에 관한 설명이 없음


3 funHDDC

3.1 Description

It provides the funHDDC algorithm which allows to cluster functional univariate and multivariate data by modeling each cluster within a specific functional subspace. See (Schmutz et al, 2018) for details on the parameters described below.

3.2 Usage

funHDDC(data, K=1:10, model="AkjBkQkDk", threshold=0.2, itermax=200, eps=1e-6, init="kmeans",
        criterion="bic", algo='EM', d_select="Cattell", init.vector=NULL, show=TRUE,
        mini.nb=c(5, 10), min.individuals=2, mc.cores=1, nb.rep=1, keepAllRes=TRUE,
        kmeans.control = list(), d_max=100)

3.3 Arguments

name description
data In the univariate case: a functional data object produced by the fda package, in the multivariate case: a list of functional data objects.
K The number of clusters, you can test one partition or multiple partitions at the same time, for example K=2:10.
model The chosen model among ‘AkjBkQkDk’, ‘AkjBQkDk’, ‘AkBkQkDk’, ‘ABkQkDk’, ‘AkBQkDk’, ‘ABQkDk’. ‘AkjBkQkDk’ is the default. You can test multiple models at the same time with the command c(). For example c(“AkjBkQkDk”,“AkjBQkDk”).
threshold The threshold of the Cattell’ scree-test used for selecting the group-specific intrinsic dimensions.
itermax The maximum number of iterations.
eps The threshold of the convergence criterion.
init A character string. It is the way to initialize the E-M algorithm. There are five ways of initialization: “kmeans” (default), “param”, “random”, “mini-em” or “vector”. See details for more information. It can also be directly initialized with a vector containing the prior classes of the observations.
criterion The criterion used for model selection: bic (default) or icl. But we recommand using slopeHeuristic function provided in this package.
algo A character string indicating the algorithm to be used. The available algorithms are the Expectation-Maximisation (“EM”), the Classification E-M (“CEM”) and the Stochastic E-M (“SEM”). The default algorithm is the “EM”.
d_select Either “Cattell” (default) or “BIC”. This parameter selects which method to use to select the intrinsic dimensions of subgroups.
init.vector A vector of integers or factors. It is a user-given initialization. It should be of the same length as of the data. Only used when init=“vector”.
show Use show = FALSE to settle off the informations that may be printed.
mini.nb A vector of integers of length two. This parameter is used in the “mini-em” initialization. The first integer sets how many times the algorithm is repeated; the second sets the maximum number of iterations the algorithm will do each time. For example, if init=“mini-em” and mini.nb=c(5,10), the algorithm wil be launched 5 times, doing each time 10 iterations; finally the algorithm will begin with the initialization that maximizes the log-likelihood.
min.individuals This parameter is used to control for the minimum population of a class. If the population of a class becomes stricly inferior to ‘min.individuals’ then the algorithm stops and gives the message: ‘pop<min.indiv.’. Here the meaning of “population of a class” is the sum of its posterior probabilities. The value of ‘min.individuals’ cannot be lower than 2.
mc.cores Positive integer, default is 1. If mc.cores>1, then parallel computing is used, using mc.cores cores. Warning for Windows users only: the parallel computing can sometimes be slower than using one single core (due to how parLapply works).
nb.rep A positive integer (default is 1 for kmeans initialization and 20 for random initialization). Each estimation (i.e. combination of (model, K, threshold)) is repeated nb.rep times and only the estimation with the highest log-likelihood is kept.
keepAllRes Logical. Should the results of all runs be kept? If so, an argument all_results is created in the results. Default is TRUE.
kmeans.control A list. The elements of this list should match the parameters of the kmeans initialization (see kmeans help for details). The parameters are “iter.max”, “nstart” and “algorithm”.
d_max A positive integer. The maximum number of dimensions to be computed. Default is 100. It means that the instrinsic dimension of any cluster cannot be larger than d_max. It quickens a lot the algorithm for datasets with a large number of variables (e.g. thousands).

3.4 Example

3.4.1 Univariate example

library(funHDDC)
data("trigo")
dim(trigo)
[1] 100 101
basis <- create.bspline.basis(c(0,1), nbasis=25)
var1 <- smooth.basis(argvals=seq(0,1,length.out = 100),y=t(trigo[,1:100]),fdParobj=basis)$fd

res.uni <- funHDDC(var1,K=2,model="AkBkQkDk",init="kmeans",threshold=0.2)
     model K threshold complexity         BIC
1 AKBKQKDK 2       0.2        103 -454,549.94

SELECTED: model  AKBKQKDK  with  2  clusters.
Selection Criterion: BIC.
plot(var1,col=res.uni$class)

[1] "done"

3.4.2 Multivariate example

data("triangle")
basis <- create.bspline.basis(c(1,21), nbasis=25)
var1 <- smooth.basis(argvals=seq(1,21, length.out = 101), y=t(triangle[,1:101]), fdParobj=basis)$fd
var2 <- smooth.basis(argvals=seq(1,21, length.out = 101), y=t(triangle[,102:202]), fdParobj=basis)$fd

res.multi <- funHDDC(list(var1,var2), K=3,model="AkjBkQkDk",init="kmeans",threshold=0.2)
      model K threshold complexity        BIC
1 AKJBKQKDK 3       0.2        736 -25,384.82

SELECTED: model  AKJBKQKDK  with  3  clusters.
Selection Criterion: BIC.
par(mfrow=c(1,2))
plot(var1, col=res.multi$class)
[1] "done"
plot(var2, col=res.multi$class)

[1] "done"

3.4.3 Clustering the “Canadian temperature” data (Ramsey & Silverman): univariate case

funFEM() 함수의 예제와 비교하기 위해 create.bspline.basis() 함수를 사용하려고 하였는데 동일한 세팅으로 코드를 돌리니 모델이 수렴하지 않았다.

basis_temp <- create.bspline.basis(c(0, 365), nbasis=21, norder=4) # norder=4 : cubic spline
daytempfd <- smooth.basis(day.5, CanadianWeather$dailyAv[,,"Temperature.C"], basis_temp,
                          fdnames=list("Day", "Station", "Deg C"))$fd
res.uni <- funHDDC(daytempfd, K=4)
Warning in funHDDC(daytempfd, K = 4): All models diverged.

Help(funHDDC)의 Examples와 같이 create.fourier.basis() 함수로 basis를 만들어서 사용하였다. 클러스터 개수는 funFEM() 함수의 Examples와 같이 4로 맞추어 비교하였다.

basis_temp <- create.fourier.basis(c(0, 365), nbasis=21, period=365) 
daytempfd <- smooth.basis(day.5, CanadianWeather$dailyAv[,,"Temperature.C"], basis_temp,
                          fdnames=list("Day", "Station", "Deg C"))$fd
res.uni <- funHDDC(daytempfd, K=4)
      model K threshold complexity       BIC
1 AKJBKQKDK 4       0.2        274 -4,648.89

SELECTED: model  AKJBKQKDK  with  4  clusters.
Selection Criterion: BIC.
## Graphical representation of groups mean curves
par(mfrow=c(1,2))
plot(daytempfd, col=res.uni$class, lwd=2, lty=1)
[1] "done"
select1 <- fd(daytempfd$coefs[,which(res.uni$class==1)], daytempfd$basis)
select2 <- fd(daytempfd$coefs[,which(res.uni$class==2)], daytempfd$basis)
select3 <- fd(daytempfd$coefs[,which(res.uni$class==3)], daytempfd$basis)
select4 <- fd(daytempfd$coefs[,which(res.uni$class==4)], daytempfd$basis)

plot(mean.fd(select1),col=1, ylim=c(-20,20), lty=1, lwd=3)
[1] "done"
lines(mean.fd(select2),col=2, lty=1,lwd=3)
lines(mean.fd(select3),col=3, lty=1,lwd=3)
lines(mean.fd(select4),col=4, lty=1,lwd=3)

3.4.4 Clustering the “Canadian temperature” data (Ramsey & Silverman): multivariate case

basis <- create.fourier.basis(c(0, 365), nbasis=21, period=365) 

daytempfd <- smooth.basis(day.5, CanadianWeather$dailyAv[,,"Temperature.C"], basis,
                          fdnames=list("Day", "Station", "Deg C"))$fd

dayprecfd <- smooth.basis(day.5, CanadianWeather$dailyAv[,,"Precipitation.mm"], basis,
                        fdnames=list("Day", "Station", "Mm"))$fd

res.multi <- funHDDC(list(daytempfd,dayprecfd), K=4)
      model K threshold complexity       BIC
1 AKJBKQKDK 4       0.2        505 -8,189.95

SELECTED: model  AKJBKQKDK  with  4  clusters.
Selection Criterion: BIC.
## Graphical representation of groups mean curves
par(mfrow=c(2,2))

## Temperature
plot(daytempfd, col=res.multi$class, lwd=2, lty=1)
[1] "done"
select1 <- fd(daytempfd$coefs[,which(res.multi$class==1)], daytempfd$basis)
select2 <- fd(daytempfd$coefs[,which(res.multi$class==2)], daytempfd$basis)
select3 <- fd(daytempfd$coefs[,which(res.multi$class==3)], daytempfd$basis)
select4 <- fd(daytempfd$coefs[,which(res.multi$class==4)], daytempfd$basis)

plot(mean.fd(select1),col=1, ylim=c(-20,20), lty=1, lwd=3)
[1] "done"
lines(mean.fd(select2),col=2, lty=1,lwd=3)
lines(mean.fd(select3),col=3, lty=1,lwd=3)
lines(mean.fd(select4),col=4, lty=1,lwd=3)

## Precipitation
plot(dayprecfd, col=res.multi$class, lwd=2, lty=1)
[1] "done"
select1b <- fd(dayprecfd$coefs[,which(res.multi$class==1)], dayprecfd$basis)
select2b <- fd(dayprecfd$coefs[,which(res.multi$class==2)], dayprecfd$basis)
select3b <- fd(dayprecfd$coefs[,which(res.multi$class==3)], dayprecfd$basis)
select4b <- fd(dayprecfd$coefs[,which(res.multi$class==4)], dayprecfd$basis)

plot(mean.fd(select1b),col=1, ylim=c(-20,20), lty=1, lwd=3)
[1] "done"
lines(mean.fd(select2b),col=2, lty=1,lwd=3)
lines(mean.fd(select3b),col=3, lty=1,lwd=3)
lines(mean.fd(select4b),col=4, lty=1,lwd=3)

3.5 Summary

  • multivariate case에 대한 example도 제공됨

4 fdakma

4.1 Description

kma jointly performs clustering and alignment of a functional dataset (multidimensional or unidimensional functions). To run kma function with different numbers of clusters and/or different alignment methods see kma.compare.

4.2 Usage

kma(x, y0 = NULL, y1 = NULL, n.clust = 1, warping.method = "affine",
    similarity.method = "d1.pearson", center.method = "k-means", seeds = NULL,
    optim.method = "L-BFGS-B", span = 0.15, t.max = 0.1, m.max = 0.1, n.out = NULL,
    tol = 0.01, fence = TRUE, iter.max = 100, show.iter = 0, nstart=2, return.all=FALSE,
    check.total.similarity=FALSE)

4.3 Arguments

name description
x matrix n.func X grid.size or vector grid.size: the abscissa values where each function is evaluated. n.func: number of functions in the dataset. grid.size: maximal number of abscissa values where each function is evaluated. The abscissa points may be unevenly spaced and they may differ from function to function. x can also be a vector of length grid.size. In this case, x will be used as abscissa grid for all functions.
y0 matrix n.func X grid.size or array n.func X grid.size X d: evaluations of the set of original functions on the abscissa grid x. n.func: number of functions in the dataset. grid.size: maximal number of abscissa values where each function is evaluated. d: (only if the sample is multidimensional) number of function components, i.e. each function is a d-dimensional curve. Default value of y0 is NULL. The parameter y0 must be provided if the chosen similarity.method concerns original functions.
y1 matrix n.func X grid.size or array n.func X grid.size X d: evaluations of the set of original functions first derivatives on the abscissa grid x. Default value of y1 is NULL. The parameter y1 must be provided if the chosen similarity.method concerns original function first derivatives.
n.clust scalar: required number of clusters. Default value is 1. Note that if n.clust=1 kma performs only alignment without clustering.
warping.method character: type of alignment required. If warping.method=‘NOalignment’ kma performs only k-mean clustering (without alignment). If warping.method=‘affine’ kma performs alignment (and possibly clustering) of functions using linear affine transformation as warping functions, i.e., x.final = dilationx + shift. If warping.method=‘shift’ kma allows only shift, i.e., x.final = x + shift. If warping.method=‘dilation’ kma allows only dilation, i.e., x.final = dilationx. Default value is ‘affine’.
similarity.method character: required similarity measure. Possible choices are: ‘d0.pearson’, ‘d1.pearson’, ‘d0.L2’, ‘d1.L2’, ‘d0.L2.centered’, ‘d1.L2.centered’. Default value is ‘d1.pearson’. See kma.similarity for details.
center.method character: type of clustering method to be used. Possible choices are: ‘k-means’ and ‘k-medoids’. Default value is ‘k-means’.
seeds vector max(n.clust) or matrix nstart X n.clust: indexes of the functions to be used as initial centers. If it is a matrix, each row contains the indexes of the initial centers of one of the nstart initializations. In the case where not all the values of seeds are provided, those not provided are randomly chosen among the n.func original functions. If seeds=NULL all the centers are randomly chosen. Default value of seeds is NULL
optim.method character: optimization method chosen to find the best warping functions at each iteration. Possible choices are: ‘L-BFGS-B’ and ‘SANN’. See optim function for details. Default method is ‘L-BFGS-B’.
span scalar: the span to be used for the loess procedure in the center estimation step when center.method=‘k-means’. Default value is 0.15. If center.method=‘k-medoids’ value of span is ignored.
t.max scalar: t.max controls the maximal allowed shift, at each iteration, in the alignment procedure with respect to the range of curve domains. t.max must be such that 0<t.max<1 (e.g., t.max=0.1 means that shift is bounded, at each iteration, between -0.1*range(x) and +0.1*range(x)). Default value is 0.1. If warping.method=‘dilation’ value of t.max is ignored.
m.max scalar: m.max controls the maximal allowed dilation, at each iteration, in the alignment procedure. m.max must be such that 0<m.max<1 (e.g., m.max=0.1 means that dilation is bounded, at each iteration, between 1-0.1 and 1+0.1 ). Default value is 0.1. If warping.method=‘shift’ value of m.max is ignored.
n.out scalar: the desired length of the abscissa for computation of the similarity indexes and the centers. Default value is round(1.1*grid.size).
tol scalar: the algorithm stops when the increment of similarity of each function with respect to the corrispondent center is lower than tol. Default value is 0.01.
fence boolean: if fence=TRUE a control is activated at the end of each iteration. The aim of the control is to avoid shift/dilation outlighers with respect to their computed distributions. If fence=TRUE the running time can increase considerably. Default value of fence is TRUE.
iter.max scalar: maximum number of iterations in the k-mean alignment cycle. Default value is 100.
show.iter boolean: if show.iter=TRUE kma shows the current iteration of the algorithm. Default value is FALSE.
nstart scalar: number of initializations with different seeds. Default value is 2. This parameter is used only if center.method is ‘k-medoids’. When center.method = ‘k-means’ one initialization is performed.
return.all boolean: if return.all=TRUE the results of all the nstart initializations are returned; the output is a list of length nstart. If return.all=FALSE only the best result is provided (the one with higher mean similarity if similarity.method is ‘d0.pearson’ or’d1.pearson’, or the one with lower distance if similarity.method is ‘d0.L2’, ‘d1.L2’, ‘d0.L2.centered’ or ‘d1.L2.centered’). Default value is FALSE.
check.total.similarity boolean: if check.total.similarity=TRUE at each iteration the algorithm checks if there is a decrease of the total similarity and stops. In the affermative case the result obtained in the penultimate iteration is returned. Default value is FALSE

4.4 Example

library(fdakma)

data(kma.data)
x <- kma.data$x # abscissas
y0 <- kma.data$y0 # evaluations of original functions
y1 <- kma.data$y1 # evaluations of original function first derivatives

# Plot of original functions
matplot(t(x),t(y0), type='l', xlab='x', ylab='orig.func')
title ('Original functions')

# Plot of original function first derivatives
matplot(t(x),t(y1), type='l', xlab='x', ylab='orig.deriv')
title ('Original function first derivatives')

# Example: result of kma function with 2 clusters, allowing affine transformation for the abscissas and considering 'd1.pearson' as similarity.method.
fdakma_example <- kma (
  x=x, y0=y0, y1=y1, n.clust = 2,
  warping.method = 'affine',
  similarity.method = 'd1.pearson',
  center.method = 'k-means',
  seeds = c(1,21)
)
kma.show.results(fdakma_example)

names(fdakma_example)
 [1] "iterations"        "x"                 "y0"               
 [4] "y1"                "n.clust"           "warping.method"   
 [7] "similarity.method" "center.method"     "x.center.orig"    
[10] "y0.center.orig"    "y1.center.orig"    "similarity.orig"  
[13] "x.final"           "n.clust.final"     "x.centers.final"  
[16] "y0.centers.final"  "y1.centers.final"  "labels"           
[19] "similarity.final"  "dilation.list"     "shift.list"       
[22] "dilation"          "shift"            
# Labels assigned to each function
fdakma_example$labels
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
# Total shifts and dilations applied to the original abscissa to obtain the aligned abscissa
fdakma_example$shift
 [1]  0.338952447  0.319586287  0.368394585  0.227746184  0.260489905
 [6]  0.302015457  0.316415039  0.380486215  0.308146261  0.255387772
[11] -0.303649497 -0.197017682 -0.259354606 -0.458456588 -0.284497204
[16] -0.437181167 -0.313016144 -0.406582174 -0.117449217 -0.300415873
[21]  0.067078440 -0.064369102  0.035742555 -0.033695095  0.042297283
[26]  0.053293202 -0.003068418 -0.004865682 -0.036760546 -0.055652637
fdakma_example$dilation
 [1] 1.0761533 1.1325796 1.0793233 1.1129867 1.0696233 1.1822717 1.2043665
 [8] 1.0838001 1.0295260 1.0353835 0.9409293 0.8306430 0.9158617 0.8306430
[15] 0.9096607 0.9899872 0.9293422 0.8814535 0.8811105 0.8843551 0.9040758
[22] 1.0357245 0.9307015 0.9214329 1.0610187 1.0327258 1.0334928 1.0353821
[29] 0.9792999 1.0661460
fdakma_example$x.center.orig
  [1] 0.00000000 0.02869034 0.05738069 0.08607103 0.11476138 0.14345172
  [7] 0.17214206 0.20083241 0.22952275 0.25821309 0.28690344 0.31559378
 [13] 0.34428413 0.37297447 0.40166481 0.43035516 0.45904550 0.48773585
 [19] 0.51642619 0.54511653 0.57380688 0.60249722 0.63118757 0.65987791
 [25] 0.68856825 0.71725860 0.74594894 0.77463928 0.80332963 0.83201997
 [31] 0.86071032 0.88940066 0.91809100 0.94678135 0.97547169 1.00416204
 [37] 1.03285238 1.06154272 1.09023307 1.11892341 1.14761375 1.17630410
 [43] 1.20499444 1.23368479 1.26237513 1.29106547 1.31975582 1.34844616
 [49] 1.37713651 1.40582685 1.43451719 1.46320754 1.49189788 1.52058823
 [55] 1.54927857 1.57796891 1.60665926 1.63534960 1.66403994 1.69273029
 [61] 1.72142063 1.75011098 1.77880132 1.80749166 1.83618201 1.86487235
 [67] 1.89356270 1.92225304 1.95094338 1.97963373 2.00832407 2.03701441
 [73] 2.06570476 2.09439510 2.12308545 2.15177579 2.18046613 2.20915648
 [79] 2.23784682 2.26653717 2.29522751 2.32391785 2.35260820 2.38129854
 [85] 2.40998888 2.43867923 2.46736957 2.49605992 2.52475026 2.55344060
 [91] 2.58213095 2.61082129 2.63951164 2.66820198 2.69689232 2.72558267
 [97] 2.75427301 2.78296336 2.81165370 2.84034404 2.86903439 2.89772473
[103] 2.92641507 2.95510542 2.98379576 3.01248611 3.04117645 3.06986679
[109] 3.09855714 3.12724748 3.15593783 3.18462817 3.21331851 3.24200886
[115] 3.27069920 3.29938954 3.32807989 3.35677023 3.38546058 3.41415092
[121] 3.44284126 3.47153161 3.50022195 3.52891230 3.55760264 3.58629298
[127] 3.61498333 3.64367367 3.67236402 3.70105436 3.72974470 3.75843505
[133] 3.78712539 3.81581573 3.84450608 3.87319642 3.90188677 3.93057711
[139] 3.95926745 3.98795780 4.01664814 4.04533849 4.07402883 4.10271917
[145] 4.13140952 4.16009986 4.18879020 4.21748055 4.24617089 4.27486124
[151] 4.30355158 4.33224192 4.36093227 4.38962261 4.41831296 4.44700330
[157] 4.47569364 4.50438399 4.53307433 4.56176468 4.59045502 4.61914536
[163] 4.64783571 4.67652605 4.70521639 4.73390674 4.76259708 4.79128743
[169] 4.81997777 4.84866811 4.87735846 4.90604880 4.93473915 4.96342949
[175] 4.99211983 5.02081018 5.04950052 5.07819086 5.10688121 5.13557155
[181] 5.16426190 5.19295224 5.22164258 5.25033293 5.27902327 5.30771362
[187] 5.33640396 5.36509430 5.39378465 5.42247499 5.45116533 5.47985568
[193] 5.50854602 5.53723637 5.56592671 5.59461705 5.62330740 5.65199774
[199] 5.68068809 5.70937843 5.73806877 5.76675912 5.79544946 5.82413981
[205] 5.85283015 5.88152049 5.91021084 5.93890118 5.96759152 5.99628187
[211] 6.02497221 6.05366256 6.08235290 6.11104324 6.13973359 6.16842393
[217] 6.19711428 6.22580462 6.25449496 6.28318531
fdakma_example$x.centers.final
  [1] -0.458456588 -0.420364671 -0.382272754 -0.344180837 -0.306088920
  [6] -0.267997004 -0.229905087 -0.191813170 -0.153721253 -0.115629336
 [11] -0.077537419 -0.039445503 -0.001353586  0.036738331  0.074830248
 [16]  0.112922165  0.151014082  0.189105999  0.227197915  0.265289832
 [21]  0.303381749  0.341473666  0.379565583  0.417657500  0.455749416
 [26]  0.493841333  0.531933250  0.570025167  0.608117084  0.646209001
 [31]  0.684300917  0.722392834  0.760484751  0.798576668  0.836668585
 [36]  0.874760502  0.912852418  0.950944335  0.989036252  1.027128169
 [41]  1.065220086  1.103312003  1.141403919  1.179495836  1.217587753
 [46]  1.255679670  1.293771587  1.331863504  1.369955420  1.408047337
 [51]  1.446139254  1.484231171  1.522323088  1.560415005  1.598506922
 [56]  1.636598838  1.674690755  1.712782672  1.750874589  1.788966506
 [61]  1.827058423  1.865150339  1.903242256  1.941334173  1.979426090
 [66]  2.017518007  2.055609924  2.093701840  2.131793757  2.169885674
 [71]  2.207977591  2.246069508  2.284161425  2.322253341  2.360345258
 [76]  2.398437175  2.436529092  2.474621009  2.512712926  2.550804842
 [81]  2.588896759  2.626988676  2.665080593  2.703172510  2.741264427
 [86]  2.779356343  2.817448260  2.855540177  2.893632094  2.931724011
 [91]  2.969815928  3.007907845  3.045999761  3.084091678  3.122183595
 [96]  3.160275512  3.198367429  3.236459346  3.274551262  3.312643179
[101]  3.350735096  3.388827013  3.426918930  3.465010847  3.503102763
[106]  3.541194680  3.579286597  3.617378514  3.655470431  3.693562348
[111]  3.731654264  3.769746181  3.807838098  3.845930015  3.884021932
[116]  3.922113849  3.960205765  3.998297682  4.036389599  4.074481516
[121]  4.112573433  4.150665350  4.188757266  4.226849183  4.264941100
[126]  4.303033017  4.341124934  4.379216851  4.417308768  4.455400684
[131]  4.493492601  4.531584518  4.569676435  4.607768352  4.645860269
[136]  4.683952185  4.722044102  4.760136019  4.798227936  4.836319853
[141]  4.874411770  4.912503686  4.950595603  4.988687520  5.026779437
[146]  5.064871354  5.102963271  5.141055187  5.179147104  5.217239021
[151]  5.255330938  5.293422855  5.331514772  5.369606688  5.407698605
[156]  5.445790522  5.483882439  5.521974356  5.560066273  5.598158189
[161]  5.636250106  5.674342023  5.712433940  5.750525857  5.788617774
[166]  5.826709691  5.864801607  5.902893524  5.940985441  5.979077358
[171]  6.017169275  6.055261192  6.093353108  6.131445025  6.169536942
[176]  6.207628859  6.245720776  6.283812693  6.321904609  6.359996526
[181]  6.398088443  6.436180360  6.474272277  6.512364194  6.550456110
[186]  6.588548027  6.626639944  6.664731861  6.702823778  6.740915695
[191]  6.779007611  6.817099528  6.855191445  6.893283362  6.931375279
[196]  6.969467196  7.007559112  7.045651029  7.083742946  7.121834863
[201]  7.159926780  7.198018697  7.236110614  7.274202530  7.312294447
[206]  7.350386364  7.388478281  7.426570198  7.464662115  7.502754031
[211]  7.540845948  7.578937865  7.617029782  7.655121699  7.693213616
[216]  7.731305532  7.769397449  7.807489366  7.845581283  7.883673200

4.4.1 Using funHDDC() function

dim(y0)
[1]  30 200
basis <- create.bspline.basis(c(0,8), nbasis=20)
var1 <- smooth.basis(argvals=seq(0,8, length.out = 200), y=t(y0), fdParobj=basis)$fd

res.uni <- funHDDC(var1, K=2, model="AkBkQkDk", init="kmeans", threshold=0.2)
     model K threshold complexity        BIC
1 AKBKQKDK 2       0.2        101 -12,536.10

SELECTED: model  AKBKQKDK  with  2  clusters.
Selection Criterion: BIC.
plot(var1, col=res.uni$class)

[1] "done"
cbind(fdakma=fdakma_example$labels, funHDDC=res.uni$class)
      fdakma funHDDC
 [1,]      1       1
 [2,]      1       1
 [3,]      1       1
 [4,]      1       1
 [5,]      1       1
 [6,]      1       1
 [7,]      1       1
 [8,]      1       1
 [9,]      1       1
[10,]      1       1
[11,]      1       2
[12,]      1       1
[13,]      1       2
[14,]      1       1
[15,]      1       2
[16,]      1       1
[17,]      1       2
[18,]      1       1
[19,]      1       2
[20,]      1       1
[21,]      2       1
[22,]      2       1
[23,]      2       1
[24,]      2       1
[25,]      2       1
[26,]      2       1
[27,]      2       1
[28,]      2       1
[29,]      2       1
[30,]      2       1
  • 두 방법의 결과가 매우 다르다.

4.5 Summary

  • x gird, original data, first derivative가 필요

  • 다변량인 경우도 지원한다고 되어 있는데 x가 변수마다 다른 경우는 지원하지 않음