1 Intro
Functional data analysis (FDA) deals with data that “provides information about curves, surfaces or anything else varying over a continuum.” This task view catalogues available packages in this rapidly developing field.
FDA는 간단히 말해 데이터가 함수 형태인 경우에 대한 분석이다. functional data에 대한 클러스터링도 생각해 볼 수 있는데 CRAN에는 functional data clustering을 위한 아래의 4가지 팩키지가 올라와 있다. 이 중 funFEM, funHDDC, fdakma 패키지의 사용법에 대해 정리해보고자 한다.
https://cran.r-project.org/web/views/FunctionalData.html
funFEM’s algorithm (Bouveyron et al., 2014) allows to cluster functional data by modeling the curves within a common and discriminative functional subspace.- Author: Charles Bouveyron
- Published: 2015-03-13
- Reference: C. Bouveyron, E. Côme and J. Jacques, The discriminative functional mixture model for the analysis of bike sharing systems, Preprint HAL n.01024186, University Paris Descartes, 2014.
funHDDCprovides the funHDDC algorithm (Bouveyron & Jacques, 2011) which allows to cluster functional data by modeling each group within a specific functional subspace.- Author: A Schmutz, J. Jacques & C. Bouveyron
- Published: 2018-10-08
- Regerence: Clustering multivariate functional data in group-specific functional subspaces
funLBMimplements model-based co-clustering of functional data, i.e., simultaneously clustering the rows and the columns of a data matrix where each entry of the matrix is a function or a time series.- Author: Charles Bouveyron and Julien Jacques
- Published: 2018-07-27
- Reference: C. Bouveyron, L. Bozzi, J. Jacques and F.-X. Jollois, The Functional Latent Block Model for the Co-Clustering of Electricity Consumption Curves, Journal of the Royal Statistical Society, Series C, 2018
fdakmaperforms clustering and alignment of a multidimensional or unidimensional functional dataset by means of k-mean alignment.- Author: Alice Parodi, Mirco Patriarca, Laura Sangalli, Piercesare Secchi, Simone Vantini, Valeria Vitelli
- Published: 2015-05-29
- Reference: Sangalli, L.M., Secchi, P., Vantini, S., Vitelli, V., 2010. “K-mean alignment for curve clustering”. Computational Statistics and Data Analysis, 54, 1219-1233.
1~3 번째 패키지는 Charles Bouveyron이 공통적으로 저자에 포함되어 있는데, 프랑스 Université Côte d’Azur의 통계학과 교수로 재직 중이고, Functional data clustering 관련 논문들을 다수 쓰신 분이다.
2 funFEM
2.1 Description
The funFEM algorithm allows to cluster time series or, more generally, functional data. It is based on a discriminative functional mixture model which allows the clustering of the data in a unique and discriminative functional subspace. This model presents the advantage to be parsimonious and can therefore handle long time series.
2.2 Usage
funFEM(fd, K=2:6, model = "AkjBk", crit = "bic", init = "hclust", Tinit = c(), maxit = 50,
eps = 1e-06, disp = FALSE, lambda = 0, graph = FALSE)2.3 Arguments
| names | description |
|---|---|
| fd | a functional data object produced by the fda package. |
| K | an integer vector specifying the numbers of mixture components (clusters) among which the model selection criterion will choose the most appropriate number of groups. Default is 2:6. |
| model | a vector of discriminative latent mixture (DLM) models to fit. There are 12 different models: “DkBk”, “DkB”, “DBk”, “DB”, “AkjBk”, “AkjB”, “AkBk”, “AkBk”, “AjBk”, “AjB”, “ABk”, “AB”. The option “all” executes the funFEM algorithm on the 12 models and select the best model according to the maximum value obtained by model selection criterion. |
| crit | the criterion to be used for model selection (‘bic’, ‘aic’ or ‘icl’). ‘bic’ is the default. |
| init | the initialization type (‘random’, ‘kmeans’ of ‘hclust’). ‘hclust’ is the default. |
| Tinit | a n x K matrix which contains posterior probabilities for initializing the algorithm (each line corresponds to an individual). |
| maxit | the maximum number of iterations before the stop of the Fisher-EM algorithm. |
| eps | the threshold value for the likelihood differences to stop the Fisher-EM algorithm. |
| disp | if true, some messages are printed during the clustering. Default is false. |
| lambda | the l0 penalty (between 0 and 1) for the sparse version. See (Bouveyron et al., 2014) for details. Default is 0. |
| graph | if true, it plots the evolution of the log-likelhood. Default is false. |
2.4 Example
다음은 R에서 제공하는 예제이다. Funtional Data Analysis 분야에서 유명한 교재인 Ramsay & Silverman에 나오는 Canadian temperature 데이터를 사용한다.
library(funFEM)
head(day.5)jan01 jan02 jan03 jan04 jan05 jan06
0.5 1.5 2.5 3.5 4.5 5.5
CanadianWeather_Temp <- CanadianWeather$dailyAv[,,"Temperature.C"]
head(CanadianWeather_Temp) St. Johns Halifax Sydney Yarmouth Charlottvl Fredericton Scheffervll
jan01 -3.6 -4.4 -3.8 -1.4 -5.8 -7.9 -22.5
jan02 -3.1 -4.2 -3.5 -1.6 -5.6 -7.5 -23.0
jan03 -3.4 -5.3 -4.6 -2.5 -7.3 -9.3 -23.0
jan04 -4.4 -5.4 -5.0 -2.3 -7.0 -8.7 -21.8
jan05 -2.9 -5.6 -4.1 -2.4 -6.7 -9.1 -23.5
jan06 -4.5 -7.1 -6.1 -3.7 -8.9 -10.9 -24.4
Arvida Bagottville Quebec Sherbrooke Montreal Ottawa Toronto London
jan01 -14.1 -14.6 -10.8 -10.1 -8.7 -9.4 -5.4 -5.2
jan02 -14.4 -14.7 -11.4 -10.5 -9.9 -10.4 -5.4 -5.7
jan03 -15.0 -15.5 -12.5 -11.4 -9.8 -9.9 -5.0 -5.3
jan04 -14.3 -14.3 -11.2 -10.2 -9.0 -9.2 -5.9 -5.8
jan05 -16.2 -15.8 -12.0 -11.3 -9.8 -10.1 -5.7 -6.5
jan06 -16.3 -16.9 -13.2 -12.6 -10.7 -11.0 -5.6 -5.6
Thunder Bay Winnipeg The Pas Churchill Regina Pr. Albert
jan01 -14.0 -17.9 -20.5 -24.9 -15.7 -19.3
jan02 -14.0 -16.9 -19.6 -25.0 -15.0 -18.0
jan03 -13.5 -17.7 -20.6 -26.5 -15.8 -18.5
jan04 -13.9 -18.5 -21.0 -26.7 -17.3 -19.4
jan05 -14.4 -18.2 -21.7 -26.7 -17.0 -20.3
jan06 -14.5 -18.4 -22.6 -28.5 -17.1 -20.3
Uranium City Edmonton Calgary Kamloops Vancouver Victoria Pr. George
jan01 -23.5 -12.6 -8.6 -5.3 2.3 3.0 -10.5
jan02 -23.6 -12.7 -8.4 -5.6 2.1 2.7 -11.4
jan03 -24.6 -14.2 -9.2 -6.5 1.9 2.7 -11.0
jan04 -25.5 -14.8 -10.8 -6.8 2.0 2.9 -11.9
jan05 -26.5 -15.4 -12.0 -7.3 1.6 2.6 -13.5
jan06 -27.9 -15.1 -11.7 -7.9 1.4 2.1 -13.2
Pr. Rupert Whitehorse Dawson Yellowknife Iqaluit Inuvik Resolute
jan01 0.4 -17.6 -28.0 -24.5 -23.3 -26.3 -30.7
jan02 0.5 -18.6 -28.9 -25.3 -24.0 -27.1 -30.6
jan03 -0.2 -18.6 -29.5 -26.1 -24.4 -27.8 -31.4
jan04 -0.6 -20.0 -29.0 -27.7 -24.7 -28.2 -31.9
jan05 -1.0 -20.9 -28.8 -27.8 -25.3 -27.2 -31.5
jan06 -0.6 -21.0 -29.9 -28.1 -26.3 -27.6 -31.1
dim(CanadianWeather_Temp)[1] 365 35
day.5에는 365일이 0.5부터 364.5까지의 값으로 들어가 있는데 day.5는 365일에 대한 그리드의 역할을 한다.
CanadianWeather 데이터의 dailyAV(일평균) 중 Temperature.C(섭씨온도)를 분석에 사용한다. 캐나다에 있는 35개 도시에 대한 데이터이다.
# Clustering the well-known "Canadian temperature" data (Ramsay & Silverman)
basis <- create.bspline.basis(c(0, 365), nbasis=21, norder=4) # norder=4 : cubic spline
fdobj <- smooth.basis(day.5, CanadianWeather_Temp,
basis, fdnames=list("Day", "Station", "Deg C"))$fd
res <- funFEM(fdobj, K=4)분석은 basis 생성 -> smoothing -> 클러스터링의 순서로 이루어진다.
create.bspline.basis()함수와smooth.basis()함수는fda팩키지에 포함되어 있다.
create.bspline.basis() 함수에 대한 설명은 아래와 같다.
Functional data objects are constructed by specifying a set of basis functions and a set of coefficients defining a linear combination of these basis functions. The B-spline basis is used for non-periodic functions. B-spline basis functions are polynomial segments jointed end-to-end at at argument values called knots, breaks or join points. The segments have specifiable smoothness across these breaks. B-splne basis functions have the advantages of very fast computation and great flexibility.
B-spline이 무엇인지 위키에 검색해보면
In the mathematical subfield of numerical analysis, a B-spline, or basis spline, is a spline function that has minimal support with respect to a given degree, smoothness, and domain partition. Any spline function of given degree can be expressed as a linear combination of B-splines of that degree. Cardinal B-splines have knots that are equidistant from each other. B-splines can be used for curve-fitting and numerical differentiation of experimental data.
B는 Basis의 약자이고, 스플라인은
스플라인 곡선은 주어진 복수의 제어점을 통과하는 부드러운 곡선으로, 인접한 두 점 사이에의 구간마다 별도의 다항식을 이용해 곡선을 정의한다.
고 하는데 그림을 보면 대강 무슨 말인지 이해할 수 있다.
funFEM() 함수를 통해 리턴되는 값은 다음과 같다.
| name | description |
|---|---|
| model | the model name. |
| K | the number of groups. |
| cls | the group membership of each individual estimated by the Fisher-EM algorithm. |
| P | the posterior probabilities of each individual for each group. |
| prms | the model parameters. |
| U | the orientation of the functional subspace according to the basis functions. |
| aic | the value of the Akaike information criterion. |
| bic | the value of the Bayesian information criterion. |
| icl | the value of the integrated completed likelihood criterion. |
| loglik | the log-likelihood values computed at each iteration of the FEM algorithm. |
| ll | the log-likelihood value obtained at the last iteration of the FEM algorithm. |
| nbprm | the number of free parameters in the model. |
| call | the call of the function. |
| plot | some information to pass to the plot.fem function. |
| crit | the model selction criterion used. |
# Visualization of the partition and the group means
par(mfrow=c(1,2))
plot(fdobj, col=res$cls, lwd=2, lty=1)[1] "done"
fdmeans <- fdobj
fdmeans$coefs <- t(res$prms$my)
plot(fdmeans, col=1:max(res$cls), lwd=2)[1] "done"
str(fdmeans)List of 3
$ coefs : num [1:21, 1:4] -6.17 -10.48 -6.83 -8.69 -4.34 ...
$ basis :List of 10
..$ call : language basisfd(type = type, rangeval = rangeval, nbasis = nbasis, params = params, dropind = dropind, quadvals = qu| __truncated__
..$ type : chr "bspline"
..$ rangeval : num [1:2] 0 365
..$ nbasis : num 21
..$ params : num [1:17] 20.3 40.6 60.8 81.1 101.4 ...
..$ dropind : NULL
..$ quadvals : NULL
..$ values : list()
..$ basisvalues: list()
..$ names : chr [1:21] "bspl4.1" "bspl4.2" "bspl4.3" "bspl4.4" ...
..- attr(*, "class")= chr "basisfd"
$ fdnames:List of 3
..$ : chr "Day"
..$ : chr "Station"
..$ : chr "Deg C"
- attr(*, "class")= chr "fd"
fdmeans에 coefs로 들어가는 res$prms$my 는 nbasis(basis 수) x K(클러스터 수)의 차원으로 되어 있는데 클러스터 마다 basis에 해당하는 계수 값들이 있어야 그래프를 그릴 수 있는 것 같다.
클러스터링 결과는
fda팩키지의plot.fd()함수로 시각화 할 수 있다. 결과를 보면 35개의 곡선이 4개의 클러스터로 나누어지는 것을 볼 수 있다.
2.5 Summary
B-spline을 이용한 functional data clustering
Reference를 보면 multivariate case도 가능한 것 같은데 Help에 이에 관한 설명이 없음
3 funHDDC
3.1 Description
It provides the funHDDC algorithm which allows to cluster functional univariate and multivariate data by modeling each cluster within a specific functional subspace. See (Schmutz et al, 2018) for details on the parameters described below.
3.2 Usage
funHDDC(data, K=1:10, model="AkjBkQkDk", threshold=0.2, itermax=200, eps=1e-6, init="kmeans",
criterion="bic", algo='EM', d_select="Cattell", init.vector=NULL, show=TRUE,
mini.nb=c(5, 10), min.individuals=2, mc.cores=1, nb.rep=1, keepAllRes=TRUE,
kmeans.control = list(), d_max=100)3.3 Arguments
| name | description |
|---|---|
| data | In the univariate case: a functional data object produced by the fda package, in the multivariate case: a list of functional data objects. |
| K | The number of clusters, you can test one partition or multiple partitions at the same time, for example K=2:10. |
| model | The chosen model among ‘AkjBkQkDk’, ‘AkjBQkDk’, ‘AkBkQkDk’, ‘ABkQkDk’, ‘AkBQkDk’, ‘ABQkDk’. ‘AkjBkQkDk’ is the default. You can test multiple models at the same time with the command c(). For example c(“AkjBkQkDk”,“AkjBQkDk”). |
| threshold | The threshold of the Cattell’ scree-test used for selecting the group-specific intrinsic dimensions. |
| itermax | The maximum number of iterations. |
| eps | The threshold of the convergence criterion. |
| init | A character string. It is the way to initialize the E-M algorithm. There are five ways of initialization: “kmeans” (default), “param”, “random”, “mini-em” or “vector”. See details for more information. It can also be directly initialized with a vector containing the prior classes of the observations. |
| criterion | The criterion used for model selection: bic (default) or icl. But we recommand using slopeHeuristic function provided in this package. |
| algo | A character string indicating the algorithm to be used. The available algorithms are the Expectation-Maximisation (“EM”), the Classification E-M (“CEM”) and the Stochastic E-M (“SEM”). The default algorithm is the “EM”. |
| d_select | Either “Cattell” (default) or “BIC”. This parameter selects which method to use to select the intrinsic dimensions of subgroups. |
| init.vector | A vector of integers or factors. It is a user-given initialization. It should be of the same length as of the data. Only used when init=“vector”. |
| show | Use show = FALSE to settle off the informations that may be printed. |
| mini.nb | A vector of integers of length two. This parameter is used in the “mini-em” initialization. The first integer sets how many times the algorithm is repeated; the second sets the maximum number of iterations the algorithm will do each time. For example, if init=“mini-em” and mini.nb=c(5,10), the algorithm wil be launched 5 times, doing each time 10 iterations; finally the algorithm will begin with the initialization that maximizes the log-likelihood. |
| min.individuals | This parameter is used to control for the minimum population of a class. If the population of a class becomes stricly inferior to ‘min.individuals’ then the algorithm stops and gives the message: ‘pop<min.indiv.’. Here the meaning of “population of a class” is the sum of its posterior probabilities. The value of ‘min.individuals’ cannot be lower than 2. |
| mc.cores | Positive integer, default is 1. If mc.cores>1, then parallel computing is used, using mc.cores cores. Warning for Windows users only: the parallel computing can sometimes be slower than using one single core (due to how parLapply works). |
| nb.rep | A positive integer (default is 1 for kmeans initialization and 20 for random initialization). Each estimation (i.e. combination of (model, K, threshold)) is repeated nb.rep times and only the estimation with the highest log-likelihood is kept. |
| keepAllRes | Logical. Should the results of all runs be kept? If so, an argument all_results is created in the results. Default is TRUE. |
| kmeans.control | A list. The elements of this list should match the parameters of the kmeans initialization (see kmeans help for details). The parameters are “iter.max”, “nstart” and “algorithm”. |
| d_max | A positive integer. The maximum number of dimensions to be computed. Default is 100. It means that the instrinsic dimension of any cluster cannot be larger than d_max. It quickens a lot the algorithm for datasets with a large number of variables (e.g. thousands). |
3.4 Example
3.4.1 Univariate example
library(funHDDC)
data("trigo")
dim(trigo)[1] 100 101
basis <- create.bspline.basis(c(0,1), nbasis=25)
var1 <- smooth.basis(argvals=seq(0,1,length.out = 100),y=t(trigo[,1:100]),fdParobj=basis)$fd
res.uni <- funHDDC(var1,K=2,model="AkBkQkDk",init="kmeans",threshold=0.2) model K threshold complexity BIC
1 AKBKQKDK 2 0.2 103 -454,549.94
SELECTED: model AKBKQKDK with 2 clusters.
Selection Criterion: BIC.
plot(var1,col=res.uni$class)[1] "done"
3.4.2 Multivariate example
data("triangle")
basis <- create.bspline.basis(c(1,21), nbasis=25)
var1 <- smooth.basis(argvals=seq(1,21, length.out = 101), y=t(triangle[,1:101]), fdParobj=basis)$fd
var2 <- smooth.basis(argvals=seq(1,21, length.out = 101), y=t(triangle[,102:202]), fdParobj=basis)$fd
res.multi <- funHDDC(list(var1,var2), K=3,model="AkjBkQkDk",init="kmeans",threshold=0.2) model K threshold complexity BIC
1 AKJBKQKDK 3 0.2 736 -25,384.82
SELECTED: model AKJBKQKDK with 3 clusters.
Selection Criterion: BIC.
par(mfrow=c(1,2))
plot(var1, col=res.multi$class)[1] "done"
plot(var2, col=res.multi$class)[1] "done"
3.4.3 Clustering the “Canadian temperature” data (Ramsey & Silverman): univariate case
funFEM() 함수의 예제와 비교하기 위해 create.bspline.basis() 함수를 사용하려고 하였는데 동일한 세팅으로 코드를 돌리니 모델이 수렴하지 않았다.
basis_temp <- create.bspline.basis(c(0, 365), nbasis=21, norder=4) # norder=4 : cubic spline
daytempfd <- smooth.basis(day.5, CanadianWeather$dailyAv[,,"Temperature.C"], basis_temp,
fdnames=list("Day", "Station", "Deg C"))$fd
res.uni <- funHDDC(daytempfd, K=4)Warning in funHDDC(daytempfd, K = 4): All models diverged.
Help(funHDDC)의 Examples와 같이 create.fourier.basis() 함수로 basis를 만들어서 사용하였다. 클러스터 개수는 funFEM() 함수의 Examples와 같이 4로 맞추어 비교하였다.
basis_temp <- create.fourier.basis(c(0, 365), nbasis=21, period=365)
daytempfd <- smooth.basis(day.5, CanadianWeather$dailyAv[,,"Temperature.C"], basis_temp,
fdnames=list("Day", "Station", "Deg C"))$fd
res.uni <- funHDDC(daytempfd, K=4) model K threshold complexity BIC
1 AKJBKQKDK 4 0.2 274 -4,648.89
SELECTED: model AKJBKQKDK with 4 clusters.
Selection Criterion: BIC.
## Graphical representation of groups mean curves
par(mfrow=c(1,2))
plot(daytempfd, col=res.uni$class, lwd=2, lty=1)[1] "done"
select1 <- fd(daytempfd$coefs[,which(res.uni$class==1)], daytempfd$basis)
select2 <- fd(daytempfd$coefs[,which(res.uni$class==2)], daytempfd$basis)
select3 <- fd(daytempfd$coefs[,which(res.uni$class==3)], daytempfd$basis)
select4 <- fd(daytempfd$coefs[,which(res.uni$class==4)], daytempfd$basis)
plot(mean.fd(select1),col=1, ylim=c(-20,20), lty=1, lwd=3)[1] "done"
lines(mean.fd(select2),col=2, lty=1,lwd=3)
lines(mean.fd(select3),col=3, lty=1,lwd=3)
lines(mean.fd(select4),col=4, lty=1,lwd=3)3.4.4 Clustering the “Canadian temperature” data (Ramsey & Silverman): multivariate case
basis <- create.fourier.basis(c(0, 365), nbasis=21, period=365)
daytempfd <- smooth.basis(day.5, CanadianWeather$dailyAv[,,"Temperature.C"], basis,
fdnames=list("Day", "Station", "Deg C"))$fd
dayprecfd <- smooth.basis(day.5, CanadianWeather$dailyAv[,,"Precipitation.mm"], basis,
fdnames=list("Day", "Station", "Mm"))$fd
res.multi <- funHDDC(list(daytempfd,dayprecfd), K=4) model K threshold complexity BIC
1 AKJBKQKDK 4 0.2 505 -8,189.95
SELECTED: model AKJBKQKDK with 4 clusters.
Selection Criterion: BIC.
## Graphical representation of groups mean curves
par(mfrow=c(2,2))
## Temperature
plot(daytempfd, col=res.multi$class, lwd=2, lty=1)[1] "done"
select1 <- fd(daytempfd$coefs[,which(res.multi$class==1)], daytempfd$basis)
select2 <- fd(daytempfd$coefs[,which(res.multi$class==2)], daytempfd$basis)
select3 <- fd(daytempfd$coefs[,which(res.multi$class==3)], daytempfd$basis)
select4 <- fd(daytempfd$coefs[,which(res.multi$class==4)], daytempfd$basis)
plot(mean.fd(select1),col=1, ylim=c(-20,20), lty=1, lwd=3)[1] "done"
lines(mean.fd(select2),col=2, lty=1,lwd=3)
lines(mean.fd(select3),col=3, lty=1,lwd=3)
lines(mean.fd(select4),col=4, lty=1,lwd=3)
## Precipitation
plot(dayprecfd, col=res.multi$class, lwd=2, lty=1)[1] "done"
select1b <- fd(dayprecfd$coefs[,which(res.multi$class==1)], dayprecfd$basis)
select2b <- fd(dayprecfd$coefs[,which(res.multi$class==2)], dayprecfd$basis)
select3b <- fd(dayprecfd$coefs[,which(res.multi$class==3)], dayprecfd$basis)
select4b <- fd(dayprecfd$coefs[,which(res.multi$class==4)], dayprecfd$basis)
plot(mean.fd(select1b),col=1, ylim=c(-20,20), lty=1, lwd=3)[1] "done"
lines(mean.fd(select2b),col=2, lty=1,lwd=3)
lines(mean.fd(select3b),col=3, lty=1,lwd=3)
lines(mean.fd(select4b),col=4, lty=1,lwd=3)3.5 Summary
- multivariate case에 대한 example도 제공됨
4 fdakma
4.1 Description
kma jointly performs clustering and alignment of a functional dataset (multidimensional or unidimensional functions). To run kma function with different numbers of clusters and/or different alignment methods see kma.compare.
4.2 Usage
kma(x, y0 = NULL, y1 = NULL, n.clust = 1, warping.method = "affine",
similarity.method = "d1.pearson", center.method = "k-means", seeds = NULL,
optim.method = "L-BFGS-B", span = 0.15, t.max = 0.1, m.max = 0.1, n.out = NULL,
tol = 0.01, fence = TRUE, iter.max = 100, show.iter = 0, nstart=2, return.all=FALSE,
check.total.similarity=FALSE)4.3 Arguments
| name | description |
|---|---|
| x | matrix n.func X grid.size or vector grid.size: the abscissa values where each function is evaluated. n.func: number of functions in the dataset. grid.size: maximal number of abscissa values where each function is evaluated. The abscissa points may be unevenly spaced and they may differ from function to function. x can also be a vector of length grid.size. In this case, x will be used as abscissa grid for all functions. |
| y0 | matrix n.func X grid.size or array n.func X grid.size X d: evaluations of the set of original functions on the abscissa grid x. n.func: number of functions in the dataset. grid.size: maximal number of abscissa values where each function is evaluated. d: (only if the sample is multidimensional) number of function components, i.e. each function is a d-dimensional curve. Default value of y0 is NULL. The parameter y0 must be provided if the chosen similarity.method concerns original functions. |
| y1 | matrix n.func X grid.size or array n.func X grid.size X d: evaluations of the set of original functions first derivatives on the abscissa grid x. Default value of y1 is NULL. The parameter y1 must be provided if the chosen similarity.method concerns original function first derivatives. |
| n.clust | scalar: required number of clusters. Default value is 1. Note that if n.clust=1 kma performs only alignment without clustering. |
| warping.method | character: type of alignment required. If warping.method=‘NOalignment’ kma performs only k-mean clustering (without alignment). If warping.method=‘affine’ kma performs alignment (and possibly clustering) of functions using linear affine transformation as warping functions, i.e., x.final = dilationx + shift. If warping.method=‘shift’ kma allows only shift, i.e., x.final = x + shift. If warping.method=‘dilation’ kma allows only dilation, i.e., x.final = dilationx. Default value is ‘affine’. |
| similarity.method | character: required similarity measure. Possible choices are: ‘d0.pearson’, ‘d1.pearson’, ‘d0.L2’, ‘d1.L2’, ‘d0.L2.centered’, ‘d1.L2.centered’. Default value is ‘d1.pearson’. See kma.similarity for details. |
| center.method | character: type of clustering method to be used. Possible choices are: ‘k-means’ and ‘k-medoids’. Default value is ‘k-means’. |
| seeds | vector max(n.clust) or matrix nstart X n.clust: indexes of the functions to be used as initial centers. If it is a matrix, each row contains the indexes of the initial centers of one of the nstart initializations. In the case where not all the values of seeds are provided, those not provided are randomly chosen among the n.func original functions. If seeds=NULL all the centers are randomly chosen. Default value of seeds is NULL |
| optim.method | character: optimization method chosen to find the best warping functions at each iteration. Possible choices are: ‘L-BFGS-B’ and ‘SANN’. See optim function for details. Default method is ‘L-BFGS-B’. |
| span | scalar: the span to be used for the loess procedure in the center estimation step when center.method=‘k-means’. Default value is 0.15. If center.method=‘k-medoids’ value of span is ignored. |
| t.max | scalar: t.max controls the maximal allowed shift, at each iteration, in the alignment procedure with respect to the range of curve domains. t.max must be such that 0<t.max<1 (e.g., t.max=0.1 means that shift is bounded, at each iteration, between -0.1*range(x) and +0.1*range(x)). Default value is 0.1. If warping.method=‘dilation’ value of t.max is ignored. |
| m.max | scalar: m.max controls the maximal allowed dilation, at each iteration, in the alignment procedure. m.max must be such that 0<m.max<1 (e.g., m.max=0.1 means that dilation is bounded, at each iteration, between 1-0.1 and 1+0.1 ). Default value is 0.1. If warping.method=‘shift’ value of m.max is ignored. |
| n.out | scalar: the desired length of the abscissa for computation of the similarity indexes and the centers. Default value is round(1.1*grid.size). |
| tol | scalar: the algorithm stops when the increment of similarity of each function with respect to the corrispondent center is lower than tol. Default value is 0.01. |
| fence | boolean: if fence=TRUE a control is activated at the end of each iteration. The aim of the control is to avoid shift/dilation outlighers with respect to their computed distributions. If fence=TRUE the running time can increase considerably. Default value of fence is TRUE. |
| iter.max | scalar: maximum number of iterations in the k-mean alignment cycle. Default value is 100. |
| show.iter | boolean: if show.iter=TRUE kma shows the current iteration of the algorithm. Default value is FALSE. |
| nstart | scalar: number of initializations with different seeds. Default value is 2. This parameter is used only if center.method is ‘k-medoids’. When center.method = ‘k-means’ one initialization is performed. |
| return.all | boolean: if return.all=TRUE the results of all the nstart initializations are returned; the output is a list of length nstart. If return.all=FALSE only the best result is provided (the one with higher mean similarity if similarity.method is ‘d0.pearson’ or’d1.pearson’, or the one with lower distance if similarity.method is ‘d0.L2’, ‘d1.L2’, ‘d0.L2.centered’ or ‘d1.L2.centered’). Default value is FALSE. |
| check.total.similarity | boolean: if check.total.similarity=TRUE at each iteration the algorithm checks if there is a decrease of the total similarity and stops. In the affermative case the result obtained in the penultimate iteration is returned. Default value is FALSE |
4.4 Example
library(fdakma)
data(kma.data)
x <- kma.data$x # abscissas
y0 <- kma.data$y0 # evaluations of original functions
y1 <- kma.data$y1 # evaluations of original function first derivatives
# Plot of original functions
matplot(t(x),t(y0), type='l', xlab='x', ylab='orig.func')
title ('Original functions')# Plot of original function first derivatives
matplot(t(x),t(y1), type='l', xlab='x', ylab='orig.deriv')
title ('Original function first derivatives')# Example: result of kma function with 2 clusters, allowing affine transformation for the abscissas and considering 'd1.pearson' as similarity.method.
fdakma_example <- kma (
x=x, y0=y0, y1=y1, n.clust = 2,
warping.method = 'affine',
similarity.method = 'd1.pearson',
center.method = 'k-means',
seeds = c(1,21)
)
kma.show.results(fdakma_example)names(fdakma_example) [1] "iterations" "x" "y0"
[4] "y1" "n.clust" "warping.method"
[7] "similarity.method" "center.method" "x.center.orig"
[10] "y0.center.orig" "y1.center.orig" "similarity.orig"
[13] "x.final" "n.clust.final" "x.centers.final"
[16] "y0.centers.final" "y1.centers.final" "labels"
[19] "similarity.final" "dilation.list" "shift.list"
[22] "dilation" "shift"
# Labels assigned to each function
fdakma_example$labels [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
# Total shifts and dilations applied to the original abscissa to obtain the aligned abscissa
fdakma_example$shift [1] 0.338952447 0.319586287 0.368394585 0.227746184 0.260489905
[6] 0.302015457 0.316415039 0.380486215 0.308146261 0.255387772
[11] -0.303649497 -0.197017682 -0.259354606 -0.458456588 -0.284497204
[16] -0.437181167 -0.313016144 -0.406582174 -0.117449217 -0.300415873
[21] 0.067078440 -0.064369102 0.035742555 -0.033695095 0.042297283
[26] 0.053293202 -0.003068418 -0.004865682 -0.036760546 -0.055652637
fdakma_example$dilation [1] 1.0761533 1.1325796 1.0793233 1.1129867 1.0696233 1.1822717 1.2043665
[8] 1.0838001 1.0295260 1.0353835 0.9409293 0.8306430 0.9158617 0.8306430
[15] 0.9096607 0.9899872 0.9293422 0.8814535 0.8811105 0.8843551 0.9040758
[22] 1.0357245 0.9307015 0.9214329 1.0610187 1.0327258 1.0334928 1.0353821
[29] 0.9792999 1.0661460
fdakma_example$x.center.orig [1] 0.00000000 0.02869034 0.05738069 0.08607103 0.11476138 0.14345172
[7] 0.17214206 0.20083241 0.22952275 0.25821309 0.28690344 0.31559378
[13] 0.34428413 0.37297447 0.40166481 0.43035516 0.45904550 0.48773585
[19] 0.51642619 0.54511653 0.57380688 0.60249722 0.63118757 0.65987791
[25] 0.68856825 0.71725860 0.74594894 0.77463928 0.80332963 0.83201997
[31] 0.86071032 0.88940066 0.91809100 0.94678135 0.97547169 1.00416204
[37] 1.03285238 1.06154272 1.09023307 1.11892341 1.14761375 1.17630410
[43] 1.20499444 1.23368479 1.26237513 1.29106547 1.31975582 1.34844616
[49] 1.37713651 1.40582685 1.43451719 1.46320754 1.49189788 1.52058823
[55] 1.54927857 1.57796891 1.60665926 1.63534960 1.66403994 1.69273029
[61] 1.72142063 1.75011098 1.77880132 1.80749166 1.83618201 1.86487235
[67] 1.89356270 1.92225304 1.95094338 1.97963373 2.00832407 2.03701441
[73] 2.06570476 2.09439510 2.12308545 2.15177579 2.18046613 2.20915648
[79] 2.23784682 2.26653717 2.29522751 2.32391785 2.35260820 2.38129854
[85] 2.40998888 2.43867923 2.46736957 2.49605992 2.52475026 2.55344060
[91] 2.58213095 2.61082129 2.63951164 2.66820198 2.69689232 2.72558267
[97] 2.75427301 2.78296336 2.81165370 2.84034404 2.86903439 2.89772473
[103] 2.92641507 2.95510542 2.98379576 3.01248611 3.04117645 3.06986679
[109] 3.09855714 3.12724748 3.15593783 3.18462817 3.21331851 3.24200886
[115] 3.27069920 3.29938954 3.32807989 3.35677023 3.38546058 3.41415092
[121] 3.44284126 3.47153161 3.50022195 3.52891230 3.55760264 3.58629298
[127] 3.61498333 3.64367367 3.67236402 3.70105436 3.72974470 3.75843505
[133] 3.78712539 3.81581573 3.84450608 3.87319642 3.90188677 3.93057711
[139] 3.95926745 3.98795780 4.01664814 4.04533849 4.07402883 4.10271917
[145] 4.13140952 4.16009986 4.18879020 4.21748055 4.24617089 4.27486124
[151] 4.30355158 4.33224192 4.36093227 4.38962261 4.41831296 4.44700330
[157] 4.47569364 4.50438399 4.53307433 4.56176468 4.59045502 4.61914536
[163] 4.64783571 4.67652605 4.70521639 4.73390674 4.76259708 4.79128743
[169] 4.81997777 4.84866811 4.87735846 4.90604880 4.93473915 4.96342949
[175] 4.99211983 5.02081018 5.04950052 5.07819086 5.10688121 5.13557155
[181] 5.16426190 5.19295224 5.22164258 5.25033293 5.27902327 5.30771362
[187] 5.33640396 5.36509430 5.39378465 5.42247499 5.45116533 5.47985568
[193] 5.50854602 5.53723637 5.56592671 5.59461705 5.62330740 5.65199774
[199] 5.68068809 5.70937843 5.73806877 5.76675912 5.79544946 5.82413981
[205] 5.85283015 5.88152049 5.91021084 5.93890118 5.96759152 5.99628187
[211] 6.02497221 6.05366256 6.08235290 6.11104324 6.13973359 6.16842393
[217] 6.19711428 6.22580462 6.25449496 6.28318531
fdakma_example$x.centers.final [1] -0.458456588 -0.420364671 -0.382272754 -0.344180837 -0.306088920
[6] -0.267997004 -0.229905087 -0.191813170 -0.153721253 -0.115629336
[11] -0.077537419 -0.039445503 -0.001353586 0.036738331 0.074830248
[16] 0.112922165 0.151014082 0.189105999 0.227197915 0.265289832
[21] 0.303381749 0.341473666 0.379565583 0.417657500 0.455749416
[26] 0.493841333 0.531933250 0.570025167 0.608117084 0.646209001
[31] 0.684300917 0.722392834 0.760484751 0.798576668 0.836668585
[36] 0.874760502 0.912852418 0.950944335 0.989036252 1.027128169
[41] 1.065220086 1.103312003 1.141403919 1.179495836 1.217587753
[46] 1.255679670 1.293771587 1.331863504 1.369955420 1.408047337
[51] 1.446139254 1.484231171 1.522323088 1.560415005 1.598506922
[56] 1.636598838 1.674690755 1.712782672 1.750874589 1.788966506
[61] 1.827058423 1.865150339 1.903242256 1.941334173 1.979426090
[66] 2.017518007 2.055609924 2.093701840 2.131793757 2.169885674
[71] 2.207977591 2.246069508 2.284161425 2.322253341 2.360345258
[76] 2.398437175 2.436529092 2.474621009 2.512712926 2.550804842
[81] 2.588896759 2.626988676 2.665080593 2.703172510 2.741264427
[86] 2.779356343 2.817448260 2.855540177 2.893632094 2.931724011
[91] 2.969815928 3.007907845 3.045999761 3.084091678 3.122183595
[96] 3.160275512 3.198367429 3.236459346 3.274551262 3.312643179
[101] 3.350735096 3.388827013 3.426918930 3.465010847 3.503102763
[106] 3.541194680 3.579286597 3.617378514 3.655470431 3.693562348
[111] 3.731654264 3.769746181 3.807838098 3.845930015 3.884021932
[116] 3.922113849 3.960205765 3.998297682 4.036389599 4.074481516
[121] 4.112573433 4.150665350 4.188757266 4.226849183 4.264941100
[126] 4.303033017 4.341124934 4.379216851 4.417308768 4.455400684
[131] 4.493492601 4.531584518 4.569676435 4.607768352 4.645860269
[136] 4.683952185 4.722044102 4.760136019 4.798227936 4.836319853
[141] 4.874411770 4.912503686 4.950595603 4.988687520 5.026779437
[146] 5.064871354 5.102963271 5.141055187 5.179147104 5.217239021
[151] 5.255330938 5.293422855 5.331514772 5.369606688 5.407698605
[156] 5.445790522 5.483882439 5.521974356 5.560066273 5.598158189
[161] 5.636250106 5.674342023 5.712433940 5.750525857 5.788617774
[166] 5.826709691 5.864801607 5.902893524 5.940985441 5.979077358
[171] 6.017169275 6.055261192 6.093353108 6.131445025 6.169536942
[176] 6.207628859 6.245720776 6.283812693 6.321904609 6.359996526
[181] 6.398088443 6.436180360 6.474272277 6.512364194 6.550456110
[186] 6.588548027 6.626639944 6.664731861 6.702823778 6.740915695
[191] 6.779007611 6.817099528 6.855191445 6.893283362 6.931375279
[196] 6.969467196 7.007559112 7.045651029 7.083742946 7.121834863
[201] 7.159926780 7.198018697 7.236110614 7.274202530 7.312294447
[206] 7.350386364 7.388478281 7.426570198 7.464662115 7.502754031
[211] 7.540845948 7.578937865 7.617029782 7.655121699 7.693213616
[216] 7.731305532 7.769397449 7.807489366 7.845581283 7.883673200
4.4.1 Using funHDDC() function
dim(y0)[1] 30 200
basis <- create.bspline.basis(c(0,8), nbasis=20)
var1 <- smooth.basis(argvals=seq(0,8, length.out = 200), y=t(y0), fdParobj=basis)$fd
res.uni <- funHDDC(var1, K=2, model="AkBkQkDk", init="kmeans", threshold=0.2) model K threshold complexity BIC
1 AKBKQKDK 2 0.2 101 -12,536.10
SELECTED: model AKBKQKDK with 2 clusters.
Selection Criterion: BIC.
plot(var1, col=res.uni$class)[1] "done"
cbind(fdakma=fdakma_example$labels, funHDDC=res.uni$class) fdakma funHDDC
[1,] 1 1
[2,] 1 1
[3,] 1 1
[4,] 1 1
[5,] 1 1
[6,] 1 1
[7,] 1 1
[8,] 1 1
[9,] 1 1
[10,] 1 1
[11,] 1 2
[12,] 1 1
[13,] 1 2
[14,] 1 1
[15,] 1 2
[16,] 1 1
[17,] 1 2
[18,] 1 1
[19,] 1 2
[20,] 1 1
[21,] 2 1
[22,] 2 1
[23,] 2 1
[24,] 2 1
[25,] 2 1
[26,] 2 1
[27,] 2 1
[28,] 2 1
[29,] 2 1
[30,] 2 1
- 두 방법의 결과가 매우 다르다.
4.5 Summary
x gird, original data, first derivative가 필요
다변량인 경우도 지원한다고 되어 있는데 x가 변수마다 다른 경우는 지원하지 않음