On considère un échantillon \(E=\{w_1,w_2,\ldots,w_n\}\) de \(n\) individus.
\(x^1,\ldots,x^d\) \(d\) variables observées, quantitatives et/ou qualitatives
\(x_i^j=x^j(w_i)\) l’observation de \(w_i\) \(\%\) à \(x^j\).
Question : Existe-t-il des groupes homogènes d’individus naturellement constitués dans \(E\), et combien ?
On suppose que nos données sont exclusivement composées d’observations de variables quantitatives.
La distance euclidienne (norme \(L_2\)) : si \(x_i=x(w_i)=(x_i^1,\ldots,x_i^d)'\) et \(x_{i'}=x(w_{i'})=(x_{i'}^1,\ldots,x_{i'}^d)'\), alors
\[ d(w_i,w_{i'})=\sqrt{\sum_{j=1}^d(x_i^j-x_{i'}^j)^2}.\]
Propriétés : Clusters hyper sphériques, clusters composés avec la d.e. sont invariants par rotation et translation, transformations linéaires peuvent causer des distorsions dans les clusters, variables avec une grande variance ou des grandes valeurs peuvent dominer les autres variables.
Nécessité d’une normalisation des données.
La distance de Minkowski : \[ d_p(w_i,w_{i'})=\left(\sum_{j=1}^d(x_i^j-x_{i'}^j)^p\right)^{1/p}.\] \(p=2\) on obtient la distance euclidienne.
Distance \(L_\infty\) :
\[d_\infty(w_i,w_{i'})=\displaystyle\max_{1\leq j\leq d} |x_{i}^j-x_{i'}^j|.\]
\[d(w_i,w_{i'})=(x_i-x_{i'})'S^{-1}(x_i-x_{i'})\] ou \(S\) est matrice variance des données. Les clusters ont une forme hyperellipsoidale et invariants par une transformation linéaire. Si \(S=I\), alors elle est équivalente à une distance euclidienne.
R> nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
> head(nba)
Name G MIN PTS FGM FGA FGP FTM FTA FTP X3PM X3PA
1 Dwyane Wade 79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765 1.1 3.5
2 LeBron James 81 37.7 28.4 9.7 19.9 0.489 7.3 9.4 0.780 1.6 4.7
3 Kobe Bryant 82 36.2 26.8 9.8 20.9 0.467 5.9 6.9 0.856 1.4 4.1
4 Dirk Nowitzki 81 37.7 25.9 9.6 20.0 0.479 6.0 6.7 0.890 0.8 2.1
5 Danny Granger 67 36.2 25.8 8.5 19.1 0.447 6.0 6.9 0.878 2.7 6.7
6 Kevin Durant 74 39.0 25.3 8.9 18.8 0.476 6.1 7.1 0.863 1.3 3.1
X3PP ORB DRB TRB AST STL BLK TO PF
1 0.317 1.1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
2 0.344 1.3 6.3 7.6 7.2 1.7 1.1 3.0 1.7
3 0.351 1.1 4.1 5.2 4.9 1.5 0.5 2.6 2.3
4 0.359 1.1 7.3 8.4 2.4 0.8 0.8 1.9 2.2
5 0.404 0.7 4.4 5.1 2.7 1.0 1.4 2.5 3.1
6 0.422 1.0 5.5 6.5 2.8 1.3 0.7 3.0 1.8
> d1=dist(nba[,-1],method = "euclidian") ## On calcule les distances sans normalisation
>
> d1 <- data.matrix(d1)
> p <- ncol(d1)
>
> library(fields)
Loading required package: spam
Loading required package: grid
Spam version 1.3-0 (2015-10-24) is loaded.
Type 'help( Spam)' or 'demo( spam)' for a short introduction
and overview of this package.
Help for individual functions is also obtained by adding the
suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
Attaching package: 'spam'
The following objects are masked from 'package:base':
backsolve, forwardsolve
Loading required package: maps
# maps v3.1: updated 'world': all lakes moved to separate new #
# 'lakes' database. Type '?world' or 'news(package="maps")'. #
> cols=topo.colors(64)
> image(1:p, 1:p, d1, axes = FALSE, xlab="", ylab="",col=cols)
> axis(1, 1:p, nba$Name, cex.axis = 0.5, las=3)
> axis(2, 1:p, nba$Name, cex.axis = 0.5, las=1)>
>
> ## On normalise nos données.
> d2=dist(scale(nba[,-1]),method = "euclidian") ## On calcule les distances sans normalisation
>
> d2 <- data.matrix(d2)
> p <- ncol(d2)
>
> image(1:p, 1:p, d2, axes = FALSE, xlab="", ylab="",col=cols)
> axis(1, 1:p, nba$Name, cex.axis = 0.5, las=3)
> axis(2, 1:p, nba$Name, cex.axis = 0.5, las=1)\[\begin{array}{rcl} d(C^l,C^{ij})&=&\alpha_i d(C^l,C^i)+ \alpha_j d(C^l,C^j)+\beta d(C^i,C^j)+\\ & & \gamma\left| d(C^l,C^i)-d(C^l,C^j)\right| \end{array}\]
| Algorithme de classification | \(\alpha_i\) | \(\alpha_j\) | \(\beta\) | \(\gamma\) |
|---|---|---|---|---|
| Single Linkage | 1/2 | 1/2 | 0 | -1/2 |
| Complete Linkage | 1/2 | 1/2 | 0 | 1/2 |
| Centroid Linkage | \(\frac{n_i}{n_i+n_j}\) | \(\frac{n_j}{n_i+n_j}\) | \(-\frac{n_in_j}{n_i+n_j}\) | 0 |
| Méthode de Ward | \(\frac{n_i+n_l}{n_i+n_j+n_l}\) | \(\frac{n_j+n_l}{n_i+n_j+n_l}\) | \(-\frac{n_l}{n_i+n_j+n_l}\) | 0 |
Single Linkage : \(d(C^l,C^{ij})=\min(d(C^l,C^{i}),d(C^l,C^{j}))\)
Complete Linkage : \(d(C^l,C^{ij})=\max(d(C^l,C^{i}),d(C^l,C^{j}))\)
Centroïd Linkage : \(d(C^l,C^{ij})=d(m_l,m_{ij})\) où \(m_l\) et \(m_{ij}\) sont resp. les centres de gravité de \(C^l\) et \(C^{ij}\).
Méthode de Ward : elle a pour objectif de minimiser l’accroissement dans la variance intra-classe : \[E=\sum_{l=1}^k \sum_{x_i\in C^l} d(x_i,m_l)^2.\]
R> X=nba[,-1]
> rownames(X)=nba$Name
> d2=dist(scale(X),method = "euclidian") ## On calcule les distances sans normalisation
> hc1=hclust(d2,method = "complete")
> hc1
Call:
hclust(d = d2, method = "complete")
Cluster method : complete
Distance : euclidean
Number of objects: 50
> plot(hc1, hang = -1, labels=nba$Name) ## Tracer Le dendogramme.> plot(1:50,hc1$height[50:1],type="b") ## Un presque effet de coude existe à 5 classes.> hc1$order ## Order des indivuds sur le dendogramme
[1] 25 46 29 8 33 13 20 36 43 15 11 19 38 50 41 44 27 47 40 24 39 16 22
[24] 31 42 45 28 48 10 35 26 30 37 49 21 23 17 32 12 14 18 34 9 1 2 7
[47] 5 4 3 6
> rownames(X)[hc1$order]
[1] "Dwight Howard " "Shaquille O'neal " "Yao Ming "
[4] "Al Jefferson " "Tim Duncan " "Antawn Jamison "
[7] "Zachary Randolph " "Pau Gasol " "LaMarcus Aldridge "
[10] "Amare Stoudemire " "Chris Bosh " "David West "
[13] "Corey Maggette " "Nate Robinson " "Richard Hamilton "
[16] "Josh Howard " "Al Harrington " "Rashard Lewis "
[19] "John Salmons " "Ben Gordon " "O.J. Mayo "
[22] "Joe Johnson " "Vince Carter " "Jason Terry "
[25] "Ray Allen " "Maurice Williams " "Jamal Crawford "
[28] "Chauncey Billups " "Carmelo Anthony " "Rudy Gay "
[31] "Paul Pierce " "Richard Jefferson " "Andre Iguodala "
[34] "Allen Iverson " "Caron Butler " "Stephen Jackson "
[37] "Devin Harris " "Deron Williams " "Brandon Roy "
[40] "Tony Parker " "Michael Redd " "Monta Ellis "
[43] "Chris Paul " "Dwyane Wade " "LeBron James "
[46] "Kevin Martin " "Danny Granger " "Dirk Nowitzki "
[49] "Kobe Bryant " "Kevin Durant "
> hc1$merge ## Déroulement du processus d'aggrégation.
[,1] [,2]
[1,] -24 -39
[2,] -42 -45
[3,] -40 1
[4,] -26 -30
[5,] -16 -22
[6,] -11 -19
[7,] -27 -47
[8,] -3 -6
[9,] -21 -23
[10,] -31 2
[11,] -41 -44
[12,] -35 4
[13,] -1 -2
[14,] -28 -48
[15,] 3 5
[16,] -50 11
[17,] -12 -14
[18,] -37 -49
[19,] -17 -32
[20,] -4 8
[21,] -13 -20
[22,] -36 -43
[23,] -10 12
[24,] -15 6
[25,] 7 15
[26,] 10 14
[27,] 9 19
[28,] -8 -33
[29,] 18 27
[30,] -5 20
[31,] -18 -34
[32,] 22 24
[33,] 25 26
[34,] -38 16
[35,] 21 32
[36,] 23 29
[37,] 17 31
[38,] -9 13
[39,] -25 -46
[40,] 36 37
[41,] 28 35
[42,] -7 30
[43,] 33 40
[44,] 34 43
[45,] 38 42
[46,] -29 41
[47,] 39 46
[48,] 44 45
[49,] 47 48
> library(ape) ## Utilisation du package ape pour mieux représenter les dendogrammes
> dt=as.phylo(hc1)
>
> plot(dt, edge.color = rainbow(length(dt$edge)/2), tip.color ="blue", edge.width = 2, font = 2,cex = 0.9)
> title("Basket players", line =-29, cex.main = 1.8)
> title("NBA data", line =-30, cex.main = 0.8, col ="gray")
> axisPhylo( 1, las = 1)> plot(dt, edge.color = rainbow(length(dt$edge)/2), tip.color ="blue", edge.width = 2, font = 2, type="cladogram",cex = 0.9)
> axisPhylo( 1, las = 1)> plot(dt, edge.color = rainbow(length(dt$edge)/2), tip.color ="blue", edge.width = 2, font = 2, type="fan",cex = 0.8)
> axisPhylo( 1, las = 1)> op = par(bg = "#DAE3FA")
> plot(hc1, col = "#487AA1", col.main = "#45ADA8", col.lab = "#7C8071",
+ col.axis = "#F38630", lwd = 3, lty = 3, sub = "",hang=-1, axes = FALSE)
> # add axis
> aa=seq(round(min(hc1$height)),round(max(hc1$height)),4)
> axis(side = 2, at = aa, col = "#F38630", labels = FALSE,
+ lwd = 2)
> # add text in margin
> mtext(aa, side = 2, at = aa, line = 1,
+ col = "#A38630", las = 2)> par(op)> labelColors = rainbow(5)
> clusMember = cutree(hc1, 5) ## On coupe en 5 classes
> hcd = as.dendrogram(hc1) ## On transforme hc1 en un objet dendogramme
> # On créé une fonction pour colorer les labels des classes
> colLab <- function(n) {
+ if (is.leaf(n)) {
+ a <- attributes(n)
+ labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]
+ attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
+ }
+ n
+ }
> # On utilise la commade dendrapply
> clusDendro = dendrapply(hcd, colLab)
> # make plot
> plot(clusDendro, main = "Les classes", type = "triangle")> require(FactoMineR)
Loading required package: FactoMineR
> require(devtools)
Loading required package: devtools
> require(factoextra)
Loading required package: factoextra
Loading required package: ggplot2
> X=nba[,-1]
> rownames(X)=nba$Name
> res.pca=PCA(X,scale.unit = T,ncp = 5,graph=F)> fviz_screeplot(res.pca, ncp = 10) + theme_classic()On choisira donc trois composantes principales pour faire une classification avec la méthode d’agrégation de Ward.
> d2=dist(res.pca$ind$coord[,1:3],method = "euclidian") ## On calcule les distances sans normalisation
> hc2=hclust(d2,method = "ward.D")
> hc2
Call:
hclust(d = d2, method = "ward.D")
Cluster method : ward.D
Distance : euclidean
Number of objects: 50 > plot(1:50,hc2$height[50:1],type="b") On choisit alors 3 classes
> clusMember = cutree(hc2, 3) ## On coupe en 3 classes
> X$cluster=as.factor(clusMember)
> res.pca=PCA(X,scale.unit = T,ncp = 5,quali.sup = 21,graph=F)> fviz_pca_ind(res.pca, geom = "text") + theme_classic()> fviz_pca_ind(res.pca, geom = "text", habillage = X$cluster, addEllipses = TRUE,
+ ellipse.level = 0.95) + theme_classic()> library(knitr)
> library(car)
> library(rgl)
> knit_hooks$set(webgl = hook_webgl)
> Z = cbind.data.frame(res.pca$ind$coord[, 1:3], X$cluster)
> colnames(Z) = c("pc1", "pc2", "pc3", "Cluster")
> next3d()
> scatter3d(x = Z$pc1, y = Z$pc2, z = Z$pc3, groups = Z$Cluster, surface = FALSE,
+ grid = FALSE, xlab = "pc1", ylab = "pc2", zlab = "pc3")You must enable Javascript to view this page properly.
> next3d()
> scatter3d(x = Z$pc1, y = Z$pc2, z = Z$pc3, groups = Z$Cluster, xlab = "pc1",
+ ylab = "pc2", zlab = "pc3", surface = FALSE, grid = FALSE, ellipsoid = TRUE)You must enable Javascript to view this page properly.
R.> library(cluster)
Attaching package: 'cluster'
The following object is masked from 'package:maps':
votes.repub
> library(FactoMineR)
> data(decathlon)
> dv = diana(decathlon[1:10, 1:10], metric = "euclidian", stand = TRUE) ## exécution de l'algorithme DIANA
> ag = hclust(dist(scale(decathlon[, 1:10])), "ward.D")> dv$diss
Dissimilarities :
SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY
CLAY 5.074947
KARPOV 4.937655 4.186841
BERNARD 4.163342 5.918668 6.234235
YURKOV 4.500498 7.139420 6.414837 5.856256
WARNERS 4.271824 5.811216 4.622854 4.525376 6.667144
ZSIVOCZKY 6.347546 6.075170 5.762872 6.341935 6.783817 4.484380
McMULLEN 5.076873 5.689521 5.396639 6.334272 5.989075 4.879141 3.453035
MARTINEAU 7.168729 7.804307 6.712978 6.431324 4.830541 6.745166 6.307940
HERNU 4.379304 5.622893 6.256807 5.270116 5.111039 4.924285 6.097268
McMULLEN MARTINEAU
CLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAU 7.224287
HERNU 5.867388 5.669173
Metric : euclidian
Number of objects : 10> plot(dv, which = 2, hang = -1)> rownames(decathlon[1:10, ])
[1] "SEBRLE" "CLAY" "KARPOV" "BERNARD" "YURKOV"
[6] "WARNERS" "ZSIVOCZKY" "McMULLEN" "MARTINEAU" "HERNU"
> dv$order
[1] 1 4 6 2 3 7 8 5 9 10
> dv$order.lab
[1] "SEBRLE" "BERNARD" "WARNERS" "CLAY" "KARPOV"
[6] "ZSIVOCZKY" "McMULLEN" "YURKOV" "MARTINEAU" "HERNU"
> dv$merge
[,1] [,2]
[1,] -7 -8
[2,] -1 -4
[3,] -2 -3
[4,] 2 -6
[5,] -5 -9
[6,] 5 -10
[7,] 3 1
[8,] 4 7
[9,] 8 6Données Soient \(w_1,\ldots,w_n\) \(n\) individus pour lesquels on a observé \(d\) variables quantitatives \(x^1,\ldots,x^d\). Donc chaque \(w_i\) est caractérisé par un vecteur d’observations \(x_i'=(x_i^1,\ldots,x_i^d)\in \mathbb{R}^d\).
Objectif A partir d’une fonction critère \(J\) donnée, partitionner \(w^1,\ldots,w_n\) en \(K\) classes (\(K\) est fixé d’avance) minimisant ou maximisant la fonction \(J\)
Méthode exhaustive : calculer pour toute les partitions possible \(C\), le critère \(J(C)\) et la solution est alors \[C_{sol}=\mbox{argmax} J(C).\]
Solution impossible si \(n\) devient grand. En effet le nombre de partitions possible est égal à \[P(n,K)=\displaystyle\frac{1}{K!}\displaystyle\frac{m=1}K (-1)^{K-m} C_K^m m^N.\] Pour \(n=30,\) et \(K=3\), \(P(n,K)\) est de l’ordre de \(2\times 10^{14}.\)
Critère à minimiser : la somme des carrées des erreurs (SSE) : \[J(C)=\displaystyle\sum_{c\in C}\displaystyle\sum_{w\in c} d(w,m(c))\] où \(m(c)\) est le centre de gravité de \(c\).
R> set.seed(8957)
> X = nba[, -1]
> rownames(X) = nba$Name
> X = scale(X) ## On normalise les variables de la base des données nba
> cl1 = kmeans(X, centers = 5) ## On choisit 5 classes
> cl1
K-means clustering with 5 clusters of sizes 4, 14, 8, 13, 11
Cluster means:
G MIN PTS FGM FGA FGP
1 0.69365672 0.6335988 2.1072476 1.8696545 1.6222704 0.4284071
2 0.52056708 -0.2052974 -0.6632702 -0.7187516 -0.4507769 -0.5765413
3 -1.34149877 -0.7369595 -0.5489468 -0.3404942 -0.2144036 -0.3048749
4 0.08201786 0.6786035 0.5630399 0.2133054 0.4747096 -0.4188578
5 -0.03607345 -0.2351273 -0.1882864 0.2304443 -0.4212910 1.2947365
FTM FTA FTP X3PM X3PA X3PP
1 1.310459757 1.2540406 0.01517921 0.04200171 0.2012151 0.02738974
2 -0.768258643 -0.8495410 0.55161466 0.98061145 0.9198406 0.31742747
3 -0.345825404 -0.3523308 -0.03541816 -0.40801665 -0.3065460 -0.16690620
4 0.642075652 0.5208762 0.15835311 0.28893487 0.3047657 0.14537629
5 -0.006054388 0.2658788 -0.86896066 -1.30805339 -1.3811104 -0.46438051
ORB DRB TRB AST STL BLK
1 -0.2721355 0.13320839 -0.01483925 1.7202964 2.20094234 0.2612288
2 -0.5584353 -0.67313782 -0.66031960 -0.2012975 -0.16401580 -0.5121459
3 -0.7204208 -0.79106595 -0.79485129 0.3366192 -0.06193128 -0.6410417
4 -0.2396805 0.04420018 -0.05654721 0.1132766 0.28620313 -0.2345242
5 1.6168953 1.33136556 1.49070505 -0.7480518 -0.88479441 1.3002070
TO PF
1 1.05965137 -0.38944972
2 -0.49328598 -0.03847701
3 -0.08221433 -0.36670149
4 0.46939464 -0.14446877
5 -0.25245613 0.62801662
Clustering vector:
Dwyane Wade LeBron James Kobe Bryant
1 1 1
Dirk Nowitzki Danny Granger Kevin Durant
4 4 4
Kevin Martin Al Jefferson Chris Paul
4 5 1
Carmelo Anthony Chris Bosh Brandon Roy
4 5 4
Antawn Jamison Tony Parker Amare Stoudemire
4 3 5
Joe Johnson Devin Harris Michael Redd
4 4 3
David West Zachary Randolph Caron Butler
5 5 4
Vince Carter Stephen Jackson Ben Gordon
2 4 2
Dwight Howard Paul Pierce Al Harrington
5 4 2
Jamal Crawford Yao Ming Richard Jefferson
2 5 2
Jason Terry Deron Williams Tim Duncan
2 3 5
Monta Ellis Rudy Gay Pau Gasol
3 2 5
Andre Iguodala Corey Maggette O.J. Mayo
4 3 2
John Salmons Richard Hamilton Ray Allen
2 3 2
LaMarcus Aldridge Josh Howard Maurice Williams
5 3 2
Shaquille O'neal Rashard Lewis Chauncey Billups
5 2 2
Allen Iverson Nate Robinson
3 2
Within cluster sum of squares by cluster:
[1] 34.47898 81.33149 94.85234 128.98078 179.00968
(between_SS / total_SS = 47.1 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault" > X = data.frame(X)
> X$cluster = as.factor(cl1$cluster)
> res.pca = PCA(X, scale.unit = T, ncp = 5, quali.sup = 21, graph = F)> library(knitr)
> library(car)
> library(rgl)
> knit_hooks$set(webgl = hook_webgl)
> Z = cbind.data.frame(res.pca$ind$coord[, 1:3], X$cluster)
> colnames(Z) = c("pc1", "pc2", "pc3", "Cluster")
> next3d()
> scatter3d(x = Z$pc1, y = Z$pc2, z = Z$pc3, groups = Z$Cluster, surface = FALSE,
+ grid = FALSE, xlab = "pc1", ylab = "pc2", zlab = "pc3", ellipsoid = TRUE)You must enable Javascript to view this page properly.
NbClust pour déterminer le nombre optimal des classes. Cet algorithme intègre une trentaine d’indices de performances de la classification.> library(NbClust)
> set.seed(1234)
> nc <- NbClust(X[, -21], min.nc = 2, max.nc = 15, method = "kmeans", )*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 5 proposed 2 as the best number of clusters
* 8 proposed 3 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 10 as the best number of clusters
* 2 proposed 12 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 3 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
> table(nc$Best.n[1, ])
0 2 3 4 5 7 10 12 14 15
2 5 8 2 1 1 1 2 1 3
On a utilisé 26 critères pour calculer la qualité des classifications
> sum(table(nc$Best.n[1, ]))
[1] 26> barplot(table(nc$Best.n[1, ]), xlab = "Numer of Clusters", ylab = "Number of Criteria",
+ main = "Number of Clusters Chosen by 26 Criteria")> set.seed(8957)
> X = nba[, -1]
> rownames(X) = nba$Name
> X = scale(X) ## On normalise les variables de la base des données nba
> cl1 = kmeans(X, centers = 3) ## On choisit 3 classes
>
> X = data.frame(X)
> X$cluster = as.factor(cl1$cluster)
> res.pca = PCA(X, scale.unit = T, ncp = 5, quali.sup = 21, graph = F)
>
> knit_hooks$set(webgl = hook_webgl)
> Z = cbind.data.frame(res.pca$ind$coord[, 1:3], X$cluster)
> colnames(Z) = c("pc1", "pc2", "pc3", "Cluster")
> next3d()
> scatter3d(x = Z$pc1, y = Z$pc2, z = Z$pc3, groups = Z$Cluster, surface = FALSE,
+ grid = FALSE, xlab = "pc1", ylab = "pc2", zlab = "pc3", ellipsoid = TRUE)You must enable Javascript to view this page properly.
Soit \(\underline{X}=(X_1,\ldots,X_n)\sim g(x\mid \theta)\) un \(n-\)échantillon de vecteurs aléatoire dans \(\mathbb{R}^d\). On suppose \[g(x\mid \theta) = \int f(x,z\mid \theta) dz\]
Objectif : calculer \(\widehat {\theta}=\mbox{argmax}_\theta L(\underline{x}\mid \theta)=\mbox{argmax}_\theta \displaystyle\prod_{i=1}^n g(x_i\mid \theta)\) où \(\underline{x}=(x_1,\ldots,x_n)\) est une réalisation de \(\underline{X}\).
On pose \((X,Z)\sim f(x,z\mid \theta)\), et la densité de \(Z\mid \theta, X=x\) est \[k(z\mid \theta,x)=\displaystyle\frac{f(x,z\mid \theta)}{g(x\mid \theta)}.\]
Donc \[ \log g(x\mid \theta)= \log f(x,z\mid \theta) - \log k(z\mid \theta,x)\]
Pour un \(\theta_0\) fixé, \[ \begin{array}{rcl} \log g(x\mid \theta) k(z\mid \theta_0,x)&=& \log f(x,z\mid \theta) k(z\mid \theta_0,x) \\ & & - \log k(z\mid \theta,x)k(z\mid \theta_0,x) \end{array}\]
En intégrant L’équation précédente par rapport à \(z\), on obtient \[\log g(x\mid \theta) = \mathbb{E}_{\theta_0}(\log f(x,Z\mid \theta))- \mathbb{E}_{\theta_0}(\log k(Z\mid x, \theta))\]
La fonction vraisemblance s’écrit alors \[\begin{array}{rcl} \log L(\underline{x}\mid \theta)&=& \displaystyle\sum_{i=1}^n \log g(x_i\mid \theta) \\ & & \\ &=& \underbrace{\mathbb{E}_{\theta_0}(\log L^c(\underline{x},Z\mid \theta))}_{Q(\theta\mid \underline{x},\theta_0)}- \displaystyle\sum_{i=1}^n \mathbb{E}_{\theta_0}(\log k(Z\mid x_i, \theta)) \end{array}\] où \(L^c\) est appelée la vraisemblance .
Ainsi, dans la recherche du maximum de \(\log L(\underline{x}\mid \theta)\), dans l’EM-algorithme on maximise, d’une façon itérative, \(Q(\theta\mid \underline{x},\theta_0)\).
Considérons une valeur initiale \(\widehat{\theta}_{(0)}\)
Calculer l’espérance \[Q(\theta\mid \underline{x},\widehat{\theta}_{(m)}) = \mathbb{E}_{\widehat{\theta}_{(m)}}(\log L^c(\underline{x},Z\mid \theta))\] où la moyenne a été calculée par rapport à \(k(z\mid \widehat{\theta}_{(m)},x)\) et \(m=0\)
Maximiser la fonction \(Q(\theta\mid \underline{x},\widehat{\theta}_{(m)})\) en \(\theta\) : \[\widehat{\theta}_{(m+1)}=\mbox{argmax}_\theta Q(\theta\mid \underline{x},\widehat{\theta}_{(m)})\] et \(m=m+1\).
Répéter les étapes 2-3 jusqu’à le point fixe soit atteint : \(\widehat{\theta}_{(m+1)}=\widehat{\theta}_{(m)}\).
R : le package mclustDescription détaillé du package dans
On doit utiliser la commande Mclust,
Mclust exécute l’EM-algorithme (avec des différentes paramétrisations), et utilise le BIC comme critère de sélection du modèle: chercher le modèle \(M_k\) correspondant au BIC maximal. \[\mbox{BIC}=2 \log \mbox{Vraisemblance} - \mbox{Nbre paramètres} \times \log(n)\]
Les modèles sont avec les paramètres suivants :
\[\alpha_1, \;\alpha_2, \ldots, \, \alpha_k=1-\alpha_1-\ldots-\alpha_{k-1}$, $\mu_1, \ldots, \mu_k\]
** Si \(d=1\),**
** Si \(d\geq 2\) **
VVV, \(\Sigma_j=\lambda_j D_jA_j(D_j)'\)
Ces données concernent la durée d’attente entre deux éruptions et la durée de l’éruption pour le “Old Faithful geyser” ? Yellowstone National Park, Wyoming, Etats-Unis
> library(mclust)
Package 'mclust' version 5.2
Type 'citation("mclust")' for citing this R package in publications.
Attaching package: 'mclust'
The following object is masked from 'package:maps':
map
> data(faithful)
> plot(faithful, col = 2, lwd = 2, pch = 3)Mclust> faithfulMclust <- Mclust(faithful)> faithfulMclust
'Mclust' model object:
best model: ellipsoidal, equal volume, shape and orientation (EEE) with 3 components
> names(faithfulMclust)
[1] "call" "data" "modelName" "n"
[5] "d" "G" "BIC" "bic"
[9] "loglik" "df" "hypvol" "parameters"
[13] "z" "classification" "uncertainty" > faithfulMclust$modelName
[1] "EEE"> faithfulMclust$d ### Dimension des données (nbrs de variables)
[1] 2
>
> faithfulMclust$G ### Nombre de groupes optimales
[1] 3
>
> faithfulMclust$BIC ### les BIC des différents modèles considérés
Bayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE
1 -4024.721 -4024.721 -3055.835 -3055.835 -3055.835 -3055.835 -2607.623
2 -3452.998 -3458.300 -2354.601 -2350.607 -2352.618 -2346.065 -2325.220
3 -3377.712 -3336.542 -2323.008 -2332.698 -2332.204 -2342.371 -2314.386
4 -3230.246 -3245.732 -2323.676 -2331.829 -2334.756 -2343.068 -2320.207
5 -3149.395 -3128.214 -2337.696 -2348.300 -2355.891 -2374.307 -2336.975
6 -3081.411 -3067.559 -2338.118 -2363.112 -2357.725 -2372.748 -2347.297
7 -2990.335 -2998.497 -2356.461 -2370.061 -2375.851 -2393.101 -2361.206
8 -2978.090 -2991.851 -2371.809 NA -2395.989 NA -2376.917
9 -2899.779 -2920.951 -2388.629 NA -2399.083 NA -2393.728
EVE VEE VVE EEV VEV EVV VVV
1 -2607.623 -2607.623 -2607.623 -2607.623 -2607.623 -2607.623 -2607.623
2 -2324.273 -2322.972 -2320.433 -2329.116 -2325.416 -2327.598 -2322.192
3 -2322.690 -2322.094 -2332.433 -2338.986 -2329.352 -2335.618 -2333.894
4 -2340.044 -2332.885 -2354.890 -2336.750 -2342.472 -2344.720 -2359.216
5 -2354.462 -2352.661 -2376.045 -2356.225 -2366.193 -2374.132 -2385.288
6 -2368.534 -2364.464 -2392.133 -2371.732 -2387.445 -2387.218 -2398.974
7 -2381.415 -2375.217 -2398.950 -2392.963 -2384.183 -2413.721 -2426.488
8 -2401.464 NA NA -2385.815 -2404.950 -2434.531 -2434.990
9 -2404.102 NA NA -2418.305 -2428.351 -2431.113 -2447.286
Top 3 models based on the BIC criterion:
EEE,3 EEE,4 VVE,2
-2314.386 -2320.207 -2320.433
>
> faithfulMclust$bic ### le bic optimal
[1] -2314.386
>
> faithfulMclust$loglik ### le log vraisemblance du modèle optimale
[1] -1126.361> nParam = (1/log(nrow(faithful))) * (2 * faithfulMclust$loglik - faithfulMclust$bic)
> nParam
[1] 11> faithfulMclust$parameters ### Estimation des paramêtres du modèle
$pro
[1] 0.4632682 0.3564512 0.1802806
$mean
[,1] [,2] [,3]
eruptions 4.475059 2.037798 3.817687
waiting 80.890383 54.493272 77.650757
$variance
$variance$modelName
[1] "EEE"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
eruptions waiting
eruptions 0.07734049 0.4757779
waiting 0.47577787 33.7403885
, , 2
eruptions waiting
eruptions 0.07734049 0.4757779
waiting 0.47577787 33.7403885
, , 3
eruptions waiting
eruptions 0.07734049 0.4757779
waiting 0.47577787 33.7403885
$variance$Sigma
eruptions waiting
eruptions 0.07734049 0.4757779
waiting 0.47577787 33.7403885
$variance$cholSigma
eruptions waiting
eruptions 0.2781016 1.710806
waiting 0.0000000 5.550994> matplot(faithfulMclust$BIC, pch = 1:ncol(faithfulMclust$BIC), col = 1:ncol(faithfulMclust$BIC),
+ lwd = 2, type = "b", axes = F, xlab = "Nombre de classes", ylab = "BIC")
> legend("bottomright", colnames(faithfulMclust$BIC), text.col = 1:ncol(faithfulMclust$BIC))
> axis(1, 1:nrow(faithfulMclust$BIC), 1:nrow(faithfulMclust$BIC))
> axis(2)
> abline(v = faithfulMclust$G, lwd = 3, col = "gray")
> abline(h = faithfulMclust$bic, lwd = 3, col = "gray")
> points(faithfulMclust$G, faithfulMclust$bic, lwd = 7, col = "gray")> head(faithfulMclust$z)
[,1] [,2] [,3]
1 2.181744e-02 1.130837e-08 9.781825e-01
2 2.475031e-21 1.000000e+00 3.320864e-13
3 2.521625e-03 2.051823e-05 9.974579e-01
4 6.553336e-14 9.999998e-01 1.664978e-07
5 9.838967e-01 7.642900e-20 1.610327e-02
6 2.104355e-07 9.975388e-01 2.461029e-03
>
> map(faithfulMclust$z) ### pour classer les individus
[1] 3 2 3 2 1 2 1 3 2 1 2 3 1 2 1 2 2 1 2 1 2 2 3 3 1 3 2 1 3 1 1 1 3 3 3
[36] 2 2 1 2 1 1 2 1 2 1 3 3 2 1 2 1 1 2 1 2 1 3 2 1 1 2 1 2 1 2 1 1 1 2 1
[71] 3 2 1 3 2 1 2 1 3 3 1 1 1 2 1 1 3 1 2 3 2 1 2 1 2 1 1 3 2 1 2 1 2 1 1
[106] 2 1 2 1 3 1 2 1 1 2 1 2 1 2 1 2 1 1 2 1 3 2 1 2 1 2 1 2 1 2 1 2 1 2 3
[141] 1 2 1 1 1 2 1 2 1 2 1 3 2 1 3 3 1 1 2 3 2 1 2 3 3 1 2 1 2 1 2 2 1 3 1
[176] 1 1 2 3 1 2 1 1 3 2 1 1 2 1 2 1 2 1 1 3 1 3 1 2 1 2 1 1 2 1 2 1 3 2 1
[211] 2 1 2 3 3 1 2 1 2 1 2 1 2 1 3 1 1 1 3 1 1 2 1 2 1 2 2 1 3 2 1 2 1 2 1
[246] 3 2 1 2 1 2 1 3 1 1 3 3 1 2 1 1 1 2 1 2 2 1 1 2 1 2 1> plot(faithfulMclust, data = faithful, what = "BIC")> plot(faithfulMclust, what = "classification")> plot(faithfulMclust, what = "density")> plot(faithfulMclust, what = "uncertainty")> uncer = 1 - apply(faithfulMclust$z, 1, max) ### c'est les 'uncertainty...
> uncerPlot(z = faithfulMclust$z)> faithfulMeii = Mclust(faithful, modelName = "EII", G = 3) #1
> faithfulMvii = Mclust(faithful, modelName = "VII", G = 3) #2
> faithfulMeei = Mclust(faithful, modelName = "EEI", G = 3) #3
> faithfulMvei = Mclust(faithful, modelName = "VEI", G = 3) #4
> faithfulMevi = Mclust(faithful, modelName = "EVI", G = 3) #5
> faithfulMvvi = Mclust(faithful, modelName = "VVI", G = 3) #6
> faithfulMeee = Mclust(faithful, modelName = "EEE", G = 3) #7
> faithfulMeev = Mclust(faithful, modelName = "EEV", G = 3) #8
> faithfulMvev = Mclust(faithful, modelName = "VEV", G = 3) #9
> faithfulMvvv = Mclust(faithful, modelName = "VVV", G = 3) #10> mclust2Dplot(faithful, parameters = faithfulMclust$parameters, z = faithfulMclust$z,
+ what = "classificaion", identify = TRUE, CEX = 0.8)> library(MASS)
> ## Construction matrice orthogonale
> matOrth = function(theta) matrix(c(sin(theta), -cos(theta), cos(theta), sin(theta)),
+ 2, 2)
>
> constParam = function(A, theta = rep(pi/2, 3), diffmean = 2, lbd = rep(2, 3),
+ mu = c(0, 0)) {
+ mu1 = mu
+ mu2 = c(mu[1] + diffmean, mu[2])
+ mu3 = c(mu[1] + diffmean/2, mu[2] + diffmean/2)
+
+ U1 = matOrth(theta[1])
+ U2 = matOrth(theta[2])
+ U3 = matOrth(theta[3])
+
+ A1 = A$A1
+ A2 = A$A2
+ A3 = A$A3
+
+ Sig1 = lbd[1] * U1 %*% A1 %*% t(U1)
+ Sig2 = lbd[2] * U2 %*% A2 %*% t(U2)
+ Sig3 = lbd[3] * U3 %*% A3 %*% t(U3)
+
+ moyenne = cbind(mu1, mu2, mu3)
+ Sigma = list(sig1 = Sig1, sig2 = Sig2, sig3 = Sig3)
+ parametres = list(m = moyenne, S = Sigma)
+ return(parametres)
+ }
>
>
> ####
>
> simModels = function(params, modelName) {
+ A = params$A
+ theta = params$theta
+ diffmean = params$diffmean
+ lbd = params$lbd
+ mu = params$mu
+ pMod = constParam(A, theta, diffmean, lbd, mu)
+ print(pMod)
+ dMod = rbind(mvrnorm(50, pMod$m[, 1], pMod$S$sig1), mvrnorm(50, pMod$m[,
+ 2], pMod$S$sig2), mvrnorm(50, pMod$m[, 3], pMod$S$sig3))
+ dMod = dMod[sample(1:150), ]
+
+ mMod = Mclust(dMod, modelName, G = 3)
+ print(mMod$parameters)
+ mclust2Dplot(dMod, parameters = mMod$parameters, z = mMod$z, , identify = TRUE,
+ scale = T)
+ }
>
>
> ### EII
> params = list(A = list(A1 = diag(rep(1, 2)), A2 = diag(rep(1, 2)), A3 = diag(rep(1,
+ 2))), theta = rep(pi/2, 3), diffmean = 2, lbd = rep(2, 3), mu = c(0, 0))
> simModels(params, "EII")
$m
mu1 mu2 mu3
[1,] 0 2 1
[2,] 0 0 1
$S
$S$sig1
[,1] [,2]
[1,] 2 0
[2,] 0 2
$S$sig2
[,1] [,2]
[1,] 2 0
[2,] 0 2
$S$sig3
[,1] [,2]
[1,] 2 0
[2,] 0 2
$pro
[1] 0.2876024 0.6119962 0.1004013
$mean
[,1] [,2] [,3]
[1,] -0.6511209 1.1911670 3.3613975
[2,] -0.4552119 0.9097453 -0.6399351
$variance
$variance$modelName
[1] "EII"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 1.681594 0.000000
[2,] 0.000000 1.681594
, , 2
[,1] [,2]
[1,] 1.681594 0.000000
[2,] 0.000000 1.681594
, , 3
[,1] [,2]
[1,] 1.681594 0.000000
[2,] 0.000000 1.681594
$variance$Sigma
[,1] [,2]
[1,] 1.681594 0.000000
[2,] 0.000000 1.681594
$variance$sigmasq
[1] 1.681594
$variance$scale
[1] 1.681594>
>
> ### VII
>
> params = list(A = list(A1 = diag(rep(1, 2)), A2 = diag(rep(1, 2)), A3 = diag(rep(1,
+ 2))), theta = rep(pi/2, 3), diffmean = 2, lbd = c(4, 2, 0.5), mu = c(0,
+ 0))
> simModels(params, "VII")
$m
mu1 mu2 mu3
[1,] 0 2 1
[2,] 0 0 1
$S
$S$sig1
[,1] [,2]
[1,] 4 0
[2,] 0 4
$S$sig2
[,1] [,2]
[1,] 2 0
[2,] 0 2
$S$sig3
[,1] [,2]
[1,] 0.5 0.0
[2,] 0.0 0.5
$pro
[1] 0.2628255 0.3638437 0.3733307
$mean
[,1] [,2] [,3]
[1,] 1.091162 2.42634010 0.3296532
[2,] 1.334313 0.06589753 -0.2302634
$variance
$variance$modelName
[1] "VII"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 0.3029242 0.0000000
[2,] 0.0000000 0.3029242
, , 2
[,1] [,2]
[1,] 1.331435 0.000000
[2,] 0.000000 1.331435
, , 3
[,1] [,2]
[1,] 4.607822 0.000000
[2,] 0.000000 4.607822
$variance$sigmasq
[1] 0.3029242 1.3314353 4.6078215
$variance$scale
[1] 0.3029242 1.3314353 4.6078215>
>
> ### EEI
>
> params = list(A = list(A1 = diag(c(4, 0.5)), A2 = diag(c(4, 0.5)), A3 = diag(c(4,
+ 0.5))), theta = rep(pi/2, 3), diffmean = 2, lbd = c(1, 1, 1), mu = c(0,
+ 0))
> simModels(params, "EEI")
$m
mu1 mu2 mu3
[1,] 0 2 1
[2,] 0 0 1
$S
$S$sig1
[,1] [,2]
[1,] 4.000000e+00 -2.143132e-16
[2,] -2.143132e-16 5.000000e-01
$S$sig2
[,1] [,2]
[1,] 4.000000e+00 -2.143132e-16
[2,] -2.143132e-16 5.000000e-01
$S$sig3
[,1] [,2]
[1,] 4.000000e+00 -2.143132e-16
[2,] -2.143132e-16 5.000000e-01
$pro
[1] 0.4873883 0.3616058 0.1510059
$mean
[,1] [,2] [,3]
[1,] 0.2514949 3.2162569 -1.4789411
[2,] 0.3720136 0.2448591 0.1831171
$variance
$variance$modelName
[1] "EEI"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 1.944633 0.000000
[2,] 0.000000 0.547669
, , 2
[,1] [,2]
[1,] 1.944633 0.000000
[2,] 0.000000 0.547669
, , 3
[,1] [,2]
[1,] 1.944633 0.000000
[2,] 0.000000 0.547669
$variance$Sigma
[,1] [,2]
[1,] 1.944633 0.000000
[2,] 0.000000 0.547669
$variance$scale
[1] 1.031996
$variance$shape
[1] 1.8843423 0.5306891>
>
> ### VEI
>
> params = list(A = list(A1 = diag(c(2, 0.5)), A2 = diag(c(2, 0.5)), A3 = diag(c(2,
+ 0.5))), theta = rep(pi/2, 3), diffmean = 2, lbd = c(4, 2, 0.5), mu = c(0,
+ 0))
> simModels(params, "VEI")
$m
mu1 mu2 mu3
[1,] 0 2 1
[2,] 0 0 1
$S
$S$sig1
[,1] [,2]
[1,] 8.00000e+00 -3.67394e-16
[2,] -3.67394e-16 2.00000e+00
$S$sig2
[,1] [,2]
[1,] 4.00000e+00 -1.83697e-16
[2,] -1.83697e-16 1.00000e+00
$S$sig3
[,1] [,2]
[1,] 1.000000e+00 -4.592425e-17
[2,] -4.592425e-17 2.500000e-01
$pro
[1] 0.3752047 0.4456503 0.1791450
$mean
[,1] [,2] [,3]
[1,] 0.8389089 1.955098324 -1.168507336
[2,] 0.8544380 -0.005212084 -0.007670071
$variance
$variance$modelName
[1] "VEI"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 1.232728 0.0000000
[2,] 0.000000 0.3288635
, , 2
[,1] [,2]
[1,] 4.033424 0.000000
[2,] 0.000000 1.076025
, , 3
[,1] [,2]
[1,] 10.42907 0.000000
[2,] 0.00000 2.782236
$variance$scale
[1] 0.6367098 2.0832818 5.3866623
$variance$shape
[1] 1.9360913 0.5165046>
>
>
> ### EVI
>
> params = list(A = list(A1 = diag(c(2, 0.5)), A2 = diag(c(6, 1)), A3 = diag(c(4,
+ 1))), theta = rep(pi/2, 3), diffmean = 4, lbd = c(2, 2, 2), mu = c(0, 0))
> simModels(params, "EVI")
$m
mu1 mu2 mu3
[1,] 0 4 2
[2,] 0 0 2
$S
$S$sig1
[,1] [,2]
[1,] 4.00000e+00 -1.83697e-16
[2,] -1.83697e-16 1.00000e+00
$S$sig2
[,1] [,2]
[1,] 1.200000e+01 -6.123234e-16
[2,] -6.123234e-16 2.000000e+00
$S$sig3
[,1] [,2]
[1,] 8.00000e+00 -3.67394e-16
[2,] -3.67394e-16 2.00000e+00
$pro
[1] 0.5463040 0.2328944 0.2208016
$mean
[,1] [,2] [,3]
[1,] 0.6015378 5.34189432 2.757756
[2,] 0.1940628 0.04327091 2.693749
$variance
$variance$modelName
[1] "EVI"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 3.346932 0.000000
[2,] 0.000000 1.446946
, , 2
[,1] [,2]
[1,] 4.314003 0.000000
[2,] 0.000000 1.122584
, , 3
[,1] [,2]
[1,] 7.298548 0.0000000
[2,] 0.000000 0.6635334
$variance$scale
[1] 2.200643
$variance$shape
1 2 3
[1,] 1.5208881 1.9603376 3.3165523
[2,] 0.6575106 0.5101162 0.3015179>
> #### VVI
>
> params = list(A = list(A1 = diag(c(2, 0.5)), A2 = diag(c(6, 1)), A3 = diag(c(4,
+ 1))), theta = rep(pi/2, 3), diffmean = 4, lbd = c(4, 2, 0.5), mu = c(0,
+ 0))
> simModels(params, "VVI")
$m
mu1 mu2 mu3
[1,] 0 4 2
[2,] 0 0 2
$S
$S$sig1
[,1] [,2]
[1,] 8.00000e+00 -3.67394e-16
[2,] -3.67394e-16 2.00000e+00
$S$sig2
[,1] [,2]
[1,] 1.200000e+01 -6.123234e-16
[2,] -6.123234e-16 2.000000e+00
$S$sig3
[,1] [,2]
[1,] 2.000000e+00 -9.184851e-17
[2,] -9.184851e-17 5.000000e-01
$pro
[1] 0.1238564 0.5581594 0.3179841
$mean
[,1] [,2] [,3]
[1,] 4.9841263 0.8807535 2.585298
[2,] -0.2796733 0.2512623 1.832758
$variance
$variance$modelName
[1] "VVI"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 11.4511 0.0000000
[2,] 0.0000 0.8643044
, , 2
[,1] [,2]
[1,] 7.584635 0.000000
[2,] 0.000000 2.231134
, , 3
[,1] [,2]
[1,] 1.810672 0.0000000
[2,] 0.000000 0.5717437
$variance$sigmasq
[1] 3.145988 4.113677 1.017468
$variance$scale
[1] 3.145988 4.113677 1.017468
$variance$shape
1 2 3
[1,] 3.6399073 1.8437606 1.7795870
[2,] 0.2747323 0.5423698 0.5619281>
> #### EEE
>
> params = list(A = list(A1 = diag(c(2, 0.5)), A2 = diag(c(2, 0.5)), A3 = diag(c(2,
+ 0.5))), theta = c(pi/4, pi/4, pi/4), diffmean = 4, lbd = c(1, 1, 1), mu = c(0,
+ 0))
> simModels(params, "EEE")
$m
mu1 mu2 mu3
[1,] 0 4 2
[2,] 0 0 2
$S
$S$sig1
[,1] [,2]
[1,] 1.25 -0.75
[2,] -0.75 1.25
$S$sig2
[,1] [,2]
[1,] 1.25 -0.75
[2,] -0.75 1.25
$S$sig3
[,1] [,2]
[1,] 1.25 -0.75
[2,] -0.75 1.25
$pro
[1] 0.3395002 0.3653070 0.2951929
$mean
[,1] [,2] [,3]
[1,] 0.024538363 1.894953 4.1150051
[2,] 0.001072659 1.940721 0.2177404
$variance
$variance$modelName
[1] "EEE"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 0.9110740 -0.6419575
[2,] -0.6419575 1.2025268
, , 2
[,1] [,2]
[1,] 0.9110740 -0.6419575
[2,] -0.6419575 1.2025268
, , 3
[,1] [,2]
[1,] 0.9110740 -0.6419575
[2,] -0.6419575 1.2025268
$variance$Sigma
[,1] [,2]
[1,] 0.9110740 -0.6419575
[2,] -0.6419575 1.2025268
$variance$cholSigma
[,1] [,2]
[1,] 0.9545019 -0.6725575
[2,] 0.0000000 0.8661369>
> ### EEV
>
>
> params = list(A = list(A1 = diag(c(4, 1)), A2 = diag(c(4, 1)), A3 = diag(c(4,
+ 1))), theta = c(pi/4, pi/3, pi/6), diffmean = 4, lbd = c(2, 2, 2), mu = c(0,
+ 0))
> simModels(params, "EEV")
$m
mu1 mu2 mu3
[1,] 0 4 2
[2,] 0 0 2
$S
$S$sig1
[,1] [,2]
[1,] 5 -3
[2,] -3 5
$S$sig2
[,1] [,2]
[1,] 6.500000 -2.598076
[2,] -2.598076 3.500000
$S$sig3
[,1] [,2]
[1,] 3.500000 -2.598076
[2,] -2.598076 6.500000
$pro
[1] 0.2078873 0.7127473 0.0793654
$mean
[,1] [,2] [,3]
[1,] -0.2247723 2.3461985 5.298593
[2,] 2.3598563 0.1043547 -3.642747
$variance
$variance$modelName
[1] "EEV"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 2.4310189 0.6967914
[2,] 0.6967914 6.3399218
, , 2
[,1] [,2]
[1,] 6.047401 -1.24233
[2,] -1.242330 2.72354
, , 3
[,1] [,2]
[1,] 6.170952 1.057097
[2,] 1.057097 2.599989
$variance$scale
[1] 3.863541
$variance$shape
[1] 1.6721488 0.5980329
$variance$orientation
, , 1
[,1] [,2]
[1,] -0.1703980 0.9853753
[2,] -0.9853753 -0.1703980
, , 2
[,1] [,2]
[1,] 0.9489340 -0.3154746
[2,] -0.3154746 -0.9489340
, , 3
[,1] [,2]
[1,] -0.9644936 0.2641063
[2,] -0.2641063 -0.9644936>
>
> ### VEV
>
> params = list(A = list(A1 = diag(c(4, 1)), A2 = diag(c(4, 1)), A3 = diag(c(4,
+ 1))), theta = c(pi/4, pi/3, pi/6), diffmean = 6, lbd = c(4, 2, 0.5), mu = c(0,
+ 0))
> simModels(params, "VEV")
$m
mu1 mu2 mu3
[1,] 0 6 3
[2,] 0 0 3
$S
$S$sig1
[,1] [,2]
[1,] 10 -6
[2,] -6 10
$S$sig2
[,1] [,2]
[1,] 6.500000 -2.598076
[2,] -2.598076 3.500000
$S$sig3
[,1] [,2]
[1,] 0.8750000 -0.6495191
[2,] -0.6495191 1.6250000
$pro
[1] 0.4035813 0.3376697 0.2587490
$mean
[,1] [,2] [,3]
[1,] 3.255515 -0.7353374 6.7437514
[2,] 2.716505 -0.2015558 -0.7565554
$variance
$variance$modelName
[1] "VEV"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 0.8609234 -0.4607194
[2,] -0.4607194 1.4183184
, , 2
[,1] [,2]
[1,] 6.965856 -3.295770
[2,] -3.295770 6.984943
, , 3
[,1] [,2]
[1,] 4.6124015 0.4501519
[2,] 0.4501519 1.7461426
$variance$scale
[1] 1.004391 6.147683 2.802013
$variance$shape
[1] 1.670741 0.598537
$variance$orientation
, , 1
[,1] [,2]
[1,] -0.4911280 0.8710874
[2,] 0.8710874 0.4911280
, , 2
[,1] [,2]
[1,] -0.7060823 -0.7081298
[2,] 0.7081298 -0.7060823
, , 3
[,1] [,2]
[1,] -0.9884440 0.1515862
[2,] -0.1515862 -0.9884440>
>
> ### VVV
>
> params = list(A = list(A1 = diag(c(6, 3)), A2 = diag(c(4, 1)), A3 = diag(c(2,
+ 0.33))), theta = c(pi/4, pi/3, pi/6), diffmean = 6, lbd = c(4, 2, 0.5),
+ mu = c(0, 0))
> simModels(params, "VVV")
$m
mu1 mu2 mu3
[1,] 0 6 3
[2,] 0 0 3
$S
$S$sig1
[,1] [,2]
[1,] 18 -6
[2,] -6 18
$S$sig2
[,1] [,2]
[1,] 6.500000 -2.598076
[2,] -2.598076 3.500000
$S$sig3
[,1] [,2]
[1,] 0.3737500 -0.3615656
[2,] -0.3615656 0.7912500
$pro
[1] 0.3191618 0.3054795 0.3753588
$mean
[,1] [,2] [,3]
[1,] 2.945927 6.6982602 -1.3916764
[2,] 3.104651 -0.4310981 0.2662897
$variance
$variance$modelName
[1] "VVV"
$variance$d
[1] 2
$variance$G
[1] 3
$variance$sigma
, , 1
[,1] [,2]
[1,] 0.2352149 -0.2795526
[2,] -0.2795526 0.6847115
, , 2
[,1] [,2]
[1,] 6.765658 -2.751391
[2,] -2.751391 3.560216
, , 3
[,1] [,2]
[1,] 19.277990 -3.210869
[2,] -3.210869 16.354115
$variance$cholsigma
, , 1
[,1] [,2]
[1,] -0.4849896 0.5764095
[2,] 0.0000000 -0.5936864
, , 2
[,1] [,2]
[1,] 2.601088 -1.057785
[2,] 0.000000 -1.562468
, , 3
[,1] [,2]
[1,] -4.390671 0.7312936
[2,] 0.000000 -3.9773514> data(precip)
> precipMclust <- Mclust(precip)
> precipBIC <- mclustBIC(precip)
> plot(precipBIC)> precipMod = summary(precipBIC, precip)
> precipMod
Best BIC values:
V,2 E,1 V,1
BIC -572.1882 -572.6445308 -572.6445308
BIC diff 0.0000 -0.4563493 -0.4563493
Classification table for model (V,2):
1 2
13 57
> precipMod2 <- summary(precipBIC, precip, G = 2)
> precipMod2$parameters
$pro
[1] 0.1813667 0.8186333
$mean
1 2
12.79252 39.78042
$variance
$variance$modelName
[1] "V"
$variance$d
[1] 1
$variance$G
[1] 2
$variance$sigmasq
[1] 16.81275 90.39377
$variance$scale
[1] 16.81275 90.39377
> precipMod2$z
[,1] [,2]
[1,] 5.433686e-37 1.00000000
[2,] 4.809770e-23 1.00000000
[3,] 9.865710e-01 0.01342897
[4,] 3.265155e-17 1.00000000
[5,] 9.516228e-01 0.04837725
[6,] 8.308280e-01 0.16917198
[7,] 3.798033e-01 0.62019667
[8,] 9.648290e-01 0.03517099
[9,] 5.115308e-13 1.00000000
[10,] 1.157442e-10 1.00000000
[11,] 9.097950e-10 1.00000000
[12,] 7.634565e-23 1.00000000
[13,] 1.918114e-28 1.00000000
[14,] 4.882606e-17 1.00000000
[15,] 1.090297e-01 0.89097031
[16,] 9.763486e-01 0.02365144
[17,] 6.107383e-07 0.99999939
[18,] 2.365409e-07 0.99999976
[19,] 1.240423e-09 1.00000000
[20,] 5.529499e-05 0.99994471
[21,] 6.973533e-05 0.99993026
[22,] 8.683768e-13 1.00000000
[23,] 3.347562e-25 1.00000000
[24,] 4.348166e-11 1.00000000
[25,] 8.183393e-12 1.00000000
[26,] 2.470239e-12 1.00000000
[27,] 4.376043e-05 0.99995624
[28,] 1.900542e-05 0.99998099
[29,] 1.102743e-04 0.99988973
[30,] 9.268014e-03 0.99073199
[31,] 7.865084e-18 1.00000000
[32,] 1.600087e-08 0.99999998
[33,] 7.772652e-08 0.99999992
[34,] 9.307606e-01 0.06923938
[35,] 1.102743e-04 0.99988973
[36,] 9.865056e-01 0.01349443
[37,] 5.080072e-08 0.99999995
[38,] 1.115211e-14 1.00000000
[39,] 9.861505e-01 0.01384953
[40,] 2.272956e-06 0.99999773
[41,] 5.856596e-08 0.99999994
[42,] 1.157442e-10 1.00000000
[43,] 1.746752e-12 1.00000000
[44,] 2.470239e-12 1.00000000
[45,] 8.887534e-01 0.11124661
[46,] 7.786047e-10 1.00000000
[47,] 2.712583e-07 0.99999973
[48,] 1.600087e-08 0.99999998
[49,] 2.724976e-05 0.99997275
[50,] 6.592837e-09 0.99999999
[51,] 1.876181e-10 1.00000000
[52,] 5.080072e-08 0.99999995
[53,] 1.467790e-12 1.00000000
[54,] 2.028002e-15 1.00000000
[55,] 2.681963e-02 0.97318037
[56,] 9.652606e-18 1.00000000
[57,] 4.346855e-15 1.00000000
[58,] 7.772652e-08 0.99999992
[59,] 9.861505e-01 0.01384953
[60,] 5.966399e-17 1.00000000
[61,] 9.252916e-01 0.07470836
[62,] 7.118441e-06 0.99999288
[63,] 4.911128e-14 1.00000000
[64,] 2.077732e-12 1.00000000
[65,] 1.062579e-09 1.00000000
[66,] 8.158411e-01 0.18415894
[67,] 4.348166e-11 1.00000000
[68,] 3.736609e-04 0.99962634
[69,] 9.402788e-01 0.05972122
[70,] 8.836719e-28 1.00000000
> mclust1Dplot(data = precip, what = "classification", parameters = precipMod2$parameters,
+ z = precipMod2$z)>
> mclust1Dplot(data = precip, what = "density", parameters = precipMod2$parameters,
+ z = precipMod2$z, lwd = 3, col = 2)
> abline(v = precipMod2$parameters$mean, lty = 3, lwd = 4, col = "gray")>
>
> ### Comment tracer la densités
> pts <- seq(from = min(precip), to = max(precip), length = 1000)
> precipDens <- dens(modelName = precipMod2$modelName, data = pts, parameters = precipMod2$parameters)
>
> plot(pts, precipDens, type = "l")> library(Ecdat)
Loading required package: Ecfun
Attaching package: 'Ecfun'
The following object is masked from 'package:base':
sign
Attaching package: 'Ecdat'
The following object is masked from 'package:MASS':
SP500
The following object is masked from 'package:car':
Mroz
The following object is masked from 'package:datasets':
Orange
> data(Cigar)
> library(mclust)
> attach(Cigar)
> sales63 = sales[which(year == 63)]
> salesMcl63 = Mclust(sales63, G = 2:8)
> salesMcl63
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
> summary(salesMcl63)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 2 components:
log.likelihood n df BIC ICL
-212.9933 46 4 -441.3011 -441.3193
Clustering table:
1 2
43 3
> salesMcl63$modelName
[1] "E"
> salesMcl63$BIC
Bayesian Information Criterion (BIC):
E V
2 -441.3011 -444.6472
3 -448.9561 -455.9954
4 -456.0961 -463.9096
5 -463.6765 -475.9442
6 -471.2836 -482.9282
7 -478.9133 -498.8937
8 -486.5113 -509.3279
Top 3 models based on the BIC criterion:
E,2 V,2 E,3
-441.3011 -444.6472 -448.9561
> salesMcl63$G
[1] 2
> ###
> mclust1Dplot(data = sales63, what = "density", parameters = salesMcl63$parameters,
+ z = salesMcl63$z, lwd = 3, col = 2)
> abline(v = salesMcl63$parameters$mean, lty = 3, lwd = 4, col = "gray")>
> salesMcl63$parameters
$pro
[1] 0.93458986 0.06541014
$mean
1 2
120.3423 226.6462
$variance
$variance$modelName
[1] "E"
$variance$d
[1] 1
$variance$G
[1] 2
$variance$sigmasq
[1] 380.6995
$Vinv
NULL
>
> ### Les classes ?
> cl63 = map(salesMcl63$z)
>
> ### On refait cette même classification sur toute les années et observe-t-on
> ### un comportement différents dans lee 'temps'
>
> years = unique(year)
> salesMclust = vector("list", length(years))
> for (j in 1:length(years)) salesMclust[[j]] = Mclust(sales[which(year == years[j])],
+ G = 2)
> salesMclust
[[1]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[2]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[3]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[4]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[5]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[6]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[7]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[8]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[9]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[10]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[11]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[12]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[13]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[14]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[15]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[16]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[17]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[18]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[19]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[20]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[21]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[22]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[23]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[24]]
'Mclust' model object:
best model: univariate, unequal variance (V) with 2 components
[[25]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[26]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[27]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[28]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[29]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
[[30]]
'Mclust' model object:
best model: univariate, equal variance (E) with 2 components
>
> ### nombre de groupes trouvés dans chaque classification
>
> Nbgrp = sapply(1:length(years), function(x) salesMclust[[x]]$G)
> names(Nbgrp) = years
> Nbgrp
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
88 89 90 91 92
2 2 2 2 2
>
> #### Graphes des moyennes ?
> means = c()
> for (j in 1:length(years)) means = rbind(means, salesMclust[[j]]$parameters$mean)
>
> rownames(means) = years
>
> plot(as.numeric(rownames(means)), means[, 1], type = "l", col = "red", lwd = 3,
+ axes = F, xlab = "years", ylab = "means", ylim = c(min(means) - 5, max(means) +
+ 5))
> points(as.numeric(rownames(means)), means[, 2], type = "l", col = "blue", lwd = 3)
> axis(1, as.numeric(rownames(means)), rownames(means))
> axis(2)>
> #### Représentation des proportions
>
> pro = c()
> for (j in 1:length(years)) pro = rbind(pro, salesMclust[[j]]$parameters$pro)
>
> rownames(pro) = years[which(Nbgrp == 2)]
>
> plot(as.numeric(rownames(pro)), pro[, 2], type = "l", col = "red", lwd = 3,
+ axes = F, xlab = "years", ylab = "prop dans la classe 2")
> axis(1, as.numeric(rownames(pro)), rownames(pro))
> axis(2)
>
>
>
> #### Chauvechement entre les classes
> states = unique(state)
> maps = matrix(1, nrow = length(states), ncol = length(years))
> dim(maps)
[1] 46 30
>
> rownames(maps) = as.character(states)
> colnames(maps) = years
> for (j in 1:length(years)) {
+ if (salesMclust[[j]]$G == 2)
+ maps[, j] = map(salesMclust[[j]]$z)
+ }
>
> grid(states, years, lty = 2, col = "gray")> image(x = states, y = years, z = maps)>
> Nbre2 = apply(maps, 1, function(x) sum((x == 2)))
> rownames(USArrests)[states[which(Nbre2 > 0)]]
[1] "Delaware" "Florida" "Iowa" "Louisiana"
[5] "Maryland" "Massachusetts" "New Hampshire" "New Jersey"
[9] "New York" "South Carolina" "Vermont" "Virginia"
[13] "Washington" "West Virginia" NA
>
> ######## représentation des densités
> pts <- seq(from = min(sales), to = max(sales), length = 1000)
> sales.dens = c()
> for (j in 1:length(years)) sales.dens <- rbind(sales.dens, dens(modelName = salesMclust[[j]]$modelName,
+ data = pts, parameters = salesMclust[[j]]$parameters))
>
> matplot(t(sales.dens[18:30, ]), type = "l", lwd = 2, col = 4, lty = 1, axes = F,
+ ylim = c(0, 0.05), ylab = "")
> matplot(t(sales.dens[1:6, ]), type = "l", lwd = 2, col = 2, lty = 1, axes = F,
+ add = T)
> matplot(t(sales.dens[7:17, ]), type = "l", lwd = 2, col = 3, lty = 1, axes = F,
+ add = T)
> legend("topright", legend = c("Avant 69", "70-80", "Après 80"), text.col = c(2,
+ 3, 4))
> axis(1, seq(1, 1000, 100), round(pts[seq(0, 1000, 100)]))
> axis(2)>
> rownames(sales.dens) = years
> distSales = dist(sales.dens)
> clDenSales = hclust(distSales, "ward")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
> plot(clDenSales, col = 2, lwd = 2, labels = years, hang = -1)> plot(sort(clDenSales$height, decreasing = T), lwd = 2, col = 2, type = "b")> ### 5 classes
> classdensales = cutree(clDenSales, 5)
> names(classdensales) = years
> classdensales
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 3 4 4 4 3 2 2 2 2
88 89 90 91 92
5 5 5 5 5
> ### Colorer les densités par classes
>
> matplot(t(sales.dens[classdensales == 1, ]), type = "l", lwd = 0.6, col = 2,
+ lty = 1, axes = F, ylim = c(0, 0.05), ylab = "")
> matplot(t(sales.dens[classdensales == 2, ]), type = "l", lwd = 0.6, col = 3,
+ lty = 1, axes = F, add = T)
> matplot(t(sales.dens[classdensales == 3, ]), type = "l", lwd = 0.6, col = 4,
+ lty = 1, axes = F, add = T)
> matplot(t(sales.dens[classdensales == 4, ]), type = "l", lwd = 0.6, col = 5,
+ lty = 1, axes = F, add = T)
> matplot(t(sales.dens[classdensales == 5, ]), type = "l", lwd = 0.6, col = 6,
+ lty = 1, axes = F, add = T)
>
> legend("topright", legend = c(1:5), text.col = c(2:6))
> axis(1, seq(1, 1000, 100), round(pts[seq(0, 1000, 100)]))
> axis(2)>
> #### Des densités moyennes
>
> dens1sales = colMeans(sales.dens[classdensales == 1, ])
> dens2sales = colMeans(sales.dens[classdensales == 2, ])
> dens3sales = colMeans(sales.dens[classdensales == 3, ])
> dens4sales = colMeans(sales.dens[classdensales == 4, ])
> dens5sales = colMeans(sales.dens[classdensales == 5, ])
>
> densclass = cbind(dens1sales, dens2sales, dens3sales, dens4sales, dens5sales)
> matplot(densclass, type = "l", col = c(2, 3, 4, 5, 6), lwd = 2, lty = 1, axes = F)
> legend("topright", legend = c(1, 2, 3, 4, 5), text.col = c(2, 3, 4, 5, 6))
> axis(1, seq(1, 1000, 100), round(pts[seq(0, 1000, 100)]))
> axis(2)Deux types de mesures :
Indices de validité interne : à partir d’un résultat de classification et de l’information intrinsèque dans la base de données on mesure la qualité du résultat : + indice de connectivité + indice de silhouette + indice de Dunn
Mesure de la stabilité de la classification : elle mesure la consistance du résultat d’une classification en la comparant avec une classification obtenue en supprimant chaque fois une colonne (une variable) : + proportion moyenne de chevauchement + distance moyenne + distance moyenne entre les moyennes
Pour tout \(x_i\), et \(j\in \{1,\ldots,n\}\setminus\{i\}\) : \(x_{(j)}\) est le \(j^{\mbox{ième}}\) voisin le plus proche de \(x_i\) :
\[d(x_i,x_{(1)})\leq d(x_i,x_{(2)})\leq \ldots d(x_i,x_{(j)}) \leq \ldots \leq d(x_i,x_{(n-1)})\]
Soit une classification \(\mathcal{C}\),
\[\mbox{Conn}(\mathcal{C})=\displaystyle\sum_{i,j=1,\ldots,n} \delta(x_i,x_{(j)})\]
\(\mbox{Conn}(\mathcal{C})\) prend ses valeurs dans \([0,+\infty[\) et c’est un coefficient qui doit être minimisée :
\(\mbox{Conn}(\mathcal{C})\) proche de zéro \(\iff\) une bonne connectivité à l’intérieure des classes.
C’est la moyenne de toutes les valeurs de la silhouette.
Valeur de la silhouette pour une observation \(x_i\) :
\[S(i)=\displaystyle\frac{b_i-a_i}{\max(b_i,a_i)}\] où
\(a_i\) est la distance moyenne entre \(x_i\) et tous les autres \(x_j\,\in \,C(i)\). \[a_i=\displaystyle\frac{1}{n(C(i))} \displaystyle\sum_{x\in C(i)} d(x_i,x).\]
\(b_i\) est la distance moyenne entre \(x_i\) et les observations dans la classe la pus proche de \(x_i\) : \[ b_i=\min_{C\in \mathcal{C}\setminus C(i)} \displaystyle\frac{1}{n(C)} \sum_{x\in C} d(x_i,x).\]
L’indice est définie par
\[S(\mathcal{C})=\displaystyle\frac{1}{n} \sum_{i} S(i)\]
La largeur de la silhouette est un nombre qui est toujours compris entre \(-1\) et \(+1\).
\(S\) doit être maximisée.
une valeur de \(S(i)\) négative indique que cet individu n’est pas dans ``la bonne classe’’ et il pourrait être déplacé vers la classe la plus proche.
\[D(\mathcal{C})=\displaystyle\frac{\min_{C,\,C'\in \mathcal{C}, \, C\not=C' }\left(\min_{x\in C,\,x'\in C' } d(x,x')\right)}{\max_{C\in \mathcal{C}} \mbox{diam}(C)}\]
où
\[\mbox{diam}(C)=\max_{x,\, x'\in C } d(x,x')\]
R.clValid, cluster et clusterGeneration> install.packages("clValid")
> install.packages("cluster")
> install.packages("clusterGeneration")> library(clValid)
> library(cluster)
> library(clusterGeneration)simClustDesign générer un jeu de données ayant déjà une répartition connue. On commencera par un jeu de données de dimension \(2\), on fera varier l’argument sepVal mesurant le degré de séparabilité’ entre les classes,> dataSim=simClustDesign(numClust=4,
+ sepVal=0.12,,
+ sepLabels=c("g1", "g2", "g3","g4"),
+ numNonNoisy=4, ### Nombre de variables non bruités, on a choisi 2
+ numNoisy=6, ### Nombre de variables bruités
+ numOutlier=0, ### Nombre de valeurs aberrantes (en proportion)
+ numReplicate=1, ### Une seule base de données
+ clustszind=2, ### Le nombre d'individus dans la classe est aléatoire.
+ rangeN=c(30,50), ### Les tailles des classes
+ )
Generating data sets. Please wait ...
The process is completed successfully!> X = dataSim$datList
> X = data.frame(dataSim$datList)
> colnames(X) = paste("X", 1:10, sep = "")
> classes = as.vector(unlist(dataSim$memList))> Xc1 = hclust(dist(X), "ward")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
> Xc1a = cutree(Xc1, 4)
> si = silhouette(Xc1a, dist(X))
> si
cluster neighbor sil_width
[1,] 1 3 0.190103942
[2,] 2 3 0.313201078
[3,] 3 4 -0.162711197
[4,] 1 3 0.111905281
[5,] 4 1 0.205553596
[6,] 1 2 0.068701883
[7,] 3 4 0.062067580
[8,] 4 3 0.217688442
[9,] 2 4 0.122495569
[10,] 4 1 0.042936085
[11,] 3 2 0.101637788
[12,] 3 1 0.079540323
[13,] 1 4 -0.005290733
[14,] 3 4 -0.002561631
[15,] 4 2 0.117713335
[16,] 1 3 -0.023104323
[17,] 3 4 0.018628451
[18,] 1 2 0.036098930
[19,] 1 2 0.060330810
[20,] 3 2 -0.208319764
[21,] 3 4 0.134417135
[22,] 1 3 0.176775344
[23,] 3 4 0.062397114
[24,] 4 3 0.194335430
[25,] 3 4 0.176181920
[26,] 1 3 0.218701635
[27,] 1 4 -0.002039830
[28,] 3 2 0.020015522
[29,] 2 3 0.216860644
[30,] 3 1 0.089890679
[31,] 3 2 -0.010823510
[32,] 3 2 0.048292264
[33,] 3 1 0.046287090
[34,] 3 1 -0.096174165
[35,] 2 3 0.156967948
[36,] 3 4 0.161768541
[37,] 3 1 0.125978191
[38,] 3 4 0.017615515
[39,] 4 2 0.025019279
[40,] 1 3 0.142821362
[41,] 3 4 -0.025388701
[42,] 3 1 -0.065779412
[43,] 4 2 0.178140269
[44,] 3 2 0.030754981
[45,] 2 4 0.166648141
[46,] 3 1 0.140081629
[47,] 3 4 0.026530036
[48,] 4 3 0.288056994
[49,] 2 3 0.295746400
[50,] 4 3 0.081988534
[51,] 1 3 0.124454455
[52,] 1 2 0.030288123
[53,] 3 4 0.027306068
[54,] 4 1 0.219304561
[55,] 3 4 0.156887318
[56,] 3 2 -0.152863516
[57,] 1 2 0.063950602
[58,] 4 1 0.160720245
[59,] 4 1 0.190046053
[60,] 4 1 0.102865040
[61,] 1 3 0.184898881
[62,] 3 4 0.076916848
[63,] 1 4 0.003662436
[64,] 2 4 0.329324021
[65,] 1 3 0.062975574
[66,] 1 4 0.145837387
[67,] 3 1 0.002658018
[68,] 4 3 0.235944657
[69,] 4 3 0.215315634
[70,] 3 2 -0.036827487
[71,] 4 1 0.275862344
[72,] 3 4 0.120185026
[73,] 1 4 0.173936714
[74,] 4 3 0.166682785
[75,] 1 3 0.208556951
[76,] 3 2 -0.143063840
[77,] 4 3 0.104550891
[78,] 1 2 0.163769938
[79,] 1 3 0.075729815
[80,] 2 1 0.249775827
[81,] 1 4 -0.001650284
[82,] 4 3 0.076755609
[83,] 1 4 0.062875094
[84,] 4 3 0.038374330
[85,] 1 3 0.127805104
[86,] 3 1 0.193123503
[87,] 1 2 0.169462325
[88,] 3 4 -0.042580001
[89,] 3 4 0.152272759
[90,] 3 2 -0.135099128
[91,] 3 1 0.116906947
[92,] 4 3 0.118138508
[93,] 1 4 0.060202669
[94,] 1 4 0.144596394
[95,] 3 4 0.124070852
[96,] 4 3 0.242940792
[97,] 3 1 -0.133152171
[98,] 1 3 -0.124155948
[99,] 1 2 0.047709838
[100,] 1 4 0.224365957
[101,] 3 2 -0.118439897
[102,] 1 4 0.095605878
[103,] 3 1 0.134156118
[104,] 4 3 -0.016112297
[105,] 3 4 0.081188905
[106,] 3 1 0.175205703
[107,] 2 3 0.368427155
[108,] 1 4 0.146922887
[109,] 3 4 0.037839629
[110,] 1 4 0.161249574
[111,] 3 2 0.048995807
[112,] 1 4 0.095805889
[113,] 2 1 0.304168244
[114,] 3 2 -0.147622281
[115,] 3 1 0.136807982
[116,] 4 2 0.107388392
[117,] 1 2 0.105432050
[118,] 1 4 -0.008941686
[119,] 2 3 0.253853873
[120,] 3 2 -0.137262821
[121,] 3 2 -0.134423092
[122,] 1 3 0.118044253
[123,] 3 4 0.082149211
[124,] 3 4 -0.128568125
[125,] 2 3 0.357022917
[126,] 3 4 -0.160828133
[127,] 4 1 0.246267819
[128,] 1 3 0.185445438
[129,] 3 2 0.148213200
[130,] 2 4 0.313613469
[131,] 3 1 0.192848670
[132,] 4 3 0.249625650
[133,] 4 3 0.188892401
[134,] 3 4 -0.092698652
[135,] 2 3 0.300210724
[136,] 3 2 -0.010937201
[137,] 4 2 0.187492131
[138,] 3 4 0.153511110
[139,] 4 1 0.173025705
[140,] 2 4 0.316900121
[141,] 3 1 0.099999636
[142,] 1 3 0.140837093
[143,] 2 1 0.299316871
[144,] 3 4 -0.126700278
[145,] 3 4 0.092105011
[146,] 2 1 0.301278974
[147,] 1 3 0.171233021
[148,] 4 3 0.139778342
[149,] 1 3 0.218724325
[150,] 3 1 -0.036241451
[151,] 3 1 -0.103439902
[152,] 3 1 -0.026847506
[153,] 4 1 0.211966893
[154,] 3 1 0.140647263
[155,] 1 4 0.223408442
[156,] 1 2 0.013340370
[157,] 3 4 -0.028768388
[158,] 3 4 -0.033492256
attr(,"Ordered")
[1] FALSE
attr(,"call")
silhouette.default(x = Xc1a, dist = dist(X))
attr(,"class")
[1] "silhouette"
> plot(si, col = rgb(seq(0.2, 0.8, length = 4), 0.1, 0))> rownames(X) = 1:nrow(X)
> intern = clValid(X, 2:6, clMethods = c("hierarchical", "kmeans", "diana"), method = "ward",
+ validation = "internal")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
> summary(intern)
Clustering Methods:
hierarchical kmeans diana
Cluster sizes:
2 3 4 5 6
Validation Measures:
2 3 4 5 6
hierarchical Connectivity 48.8587 68.9048 90.7278 111.7567 124.9575
Dunn 0.2165 0.1859 0.1859 0.2139 0.2266
Silhouette 0.0902 0.0921 0.0986 0.1034 0.1144
kmeans Connectivity 52.8183 76.5687 70.2087 112.7496 130.6111
Dunn 0.1875 0.1875 0.2312 0.1798 0.2066
Silhouette 0.1111 0.1264 0.1549 0.1435 0.1460
diana Connectivity 65.7976 79.9937 97.8480 120.2476 125.6111
Dunn 0.1811 0.1853 0.2024 0.2099 0.2100
Silhouette 0.0989 0.1136 0.1256 0.1081 0.1354
Optimal Scores:
Score Method Clusters
Connectivity 48.8587 hierarchical 2
Dunn 0.2312 kmeans 4
Silhouette 0.1549 kmeans 4
> op <- par(no.readonly = TRUE)
> par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
> plot(intern, legend = FALSE)
> plot(nClusters(intern), measures(intern, "Dunn")[, , 1], type = "n", axes = F,
+ xlab = "", ylab = "")
> legend("center", clusterMethods(intern), col = 1:9, lty = 1:9, pch = paste(1:9))> par(op)Average Proportion of Non-overlap} (APN)
l’APN mesure la proportion moyenne des individus qui changent de classes quand on effectue une classification en supprimant une variable de la base des données.
Calcul :
Soit \(\mathcal{C}^{{0}}=\{C_1^{{0}},\ldots,C_K^{{0}}\}\) une classification obtenue en utilisant d’abord toute la base de données et \(\mathcal{C}^{{l}}=\{C_1^{{l}},\ldots,C_K^{{l}}\}\) est une classification en utilisant la base privée de la \(l-{\mbox{ième}}\) variable.
\(\mbox{APN}(K)\in [0,1]\), des valeurs proches de zéro indiquent une forte consistance ou stabilité dans la classification originale \(\mathcal{C}^0\).
\(\mbox{AD}(K)\in [0,+\infty[\) et des valeurs de \(\mbox{AD}(K)\) proches de zéro indique une forte consistance.
\(\mbox{ADM}(K)\in [0,+\infty[\) et des valeurs de \(\mbox{ADM}(K)\) proches de zéro indique une forte consistance.
\[\mbox{FOM}(l,K)^2= \displaystyle\frac{1}{n} \sum_{C^l\in \mathcal{C}^l} \sum_{x\in C^l}\mbox{dist}(x,\overline{x}_{C^l})\]
et l’indice FOM est alors égal à
\[\mbox{FOM}(K)=\sqrt{\displaystyle\frac{n}{n-K}\sum_{l=1}^d\, \mbox{FOM}(l,K)}\]
\(\mbox{FOM}(K)\in [0,+\infty[\) et des valeurs de \(\mbox{FOM}(K)\) proches de zéro indique une forte consistance.
R> rownames(X) = 1:nrow(X)
> stab <- clValid(X, 2:6, clMethods = c("hierarchical", "kmeans", "diana"), validation = "stability")
> summary(stab)
Clustering Methods:
hierarchical kmeans diana
Cluster sizes:
2 3 4 5 6
Validation Measures:
2 3 4 5 6
hierarchical APN 0.0075 0.0174 0.0266 0.0641 0.0953
AD 21.2830 21.0727 20.8003 20.7073 20.6435
ADM 0.2924 0.4416 0.6495 1.3137 1.7736
FOM 4.9361 4.9070 4.9060 4.9047 4.9067
kmeans APN 0.2826 0.4118 0.3470 0.2547 0.3678
AD 20.8614 20.3028 19.2150 18.1484 18.1262
ADM 4.1018 6.4673 5.5338 4.2931 5.8958
FOM 4.9409 4.9057 4.8794 4.9082 4.8546
diana APN 0.2690 0.3388 0.3937 0.4170 0.4005
AD 20.8205 20.3769 19.7321 19.4660 18.7612
ADM 4.2391 5.8878 6.2248 6.5781 6.6181
FOM 4.9344 4.9184 4.9247 4.9090 4.9030
Optimal Scores:
Score Method Clusters
APN 0.0075 hierarchical 2
AD 18.1262 kmeans 6
ADM 0.2924 hierarchical 2
FOM 4.8546 kmeans 6
>
> op <- par(no.readonly = TRUE)
> par(mfrow = c(3, 2), mar = c(4, 4, 3, 1))
> plot(stab, measures = c("ADM", "AD", "APN", "FOM"), legend = FALSE)
> plot(nClusters(stab), measures(stab, "FOM")[, , 1], type = "n", axes = F, xlab = "",
+ ylab = "")
> legend("center", clusterMethods(stab), col = 1:9, lty = 1:9, pch = paste(1:9))
> par(op)\(\mathcal{P}\) une partition existante de l’ensemble des individus \(E\), \(\mathcal{P}=\{G_1,\ldots,G_r\}\).
\(\mathcal{C}=\{C_1,\ldots,C_k\}\) une classification obtenue à l’aide d’un algorithme de classification (hiérarchique, k-means…)
Objectif : comparer la classification \(\mathcal{C}\) et la partition \(\mathcal{P}\)
Soient deux individus \(x_i\) et \(x_{i'}\). Quatre cas sont possibles
Cas 1 \(\exists C\in \mathcal{C}\) et \(\exists G\in \mathcal{P}\) tel que \[x_i,\, x_{i'}\in C\mbox{ et }x_i,\, x_{i'}\in G\]
Cas 2 \(\exists C\in \mathcal{C}\) et \(\exists G\not=G'\in \mathcal{P}\) tel que \[x_i,\, x_{i'}\in C\mbox{ et }x_i\in G,\, x_{i'}\in G'\]
Cas 4 \(\exists C\not=C'\in \mathcal{C}\) et \(\exists G\not=G'\in \mathcal{P}\) tel que \[x_i\in C,\, x_{i'}\in C'\mbox{ et }x_i\in G,\, x_{i'}\in G'\]
Notons par \(\mathcal{A}_l\) l’ensemble des paires \((x_i,x_{i'})\) vérifiant le cas \(l\) et par \(a_l=\left|\mathcal{A}_l\right|\). Trois indices :
R> library(clv)
Loading required package: class
Attaching package: 'clv'
The following object is masked from 'package:clValid':
connectivity
> dX = dist(X)
> Xc1 = hclust(dX, "ward")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
>
> estim.class = cutree(Xc1, 4)
> estim.class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 1 4 1 3 4 2 4 3 3 1 3 4 1 3 1
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
1 3 3 1 3 4 3 1 1 3 2 3 3 3 3 3 2 3
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
3 3 4 1 3 3 4 3 2 3 3 4 2 4 1 1 3 4
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
3 3 1 4 4 4 1 3 1 2 1 1 3 4 4 3 4 3
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
1 4 1 3 4 1 1 2 1 4 1 4 1 3 1 3 3 3
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
3 4 1 1 3 4 3 1 1 1 3 1 3 4 3 3 2 1
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
3 1 3 1 2 3 3 4 1 1 2 3 3 1 3 3 2 3
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
4 1 3 2 3 4 4 3 2 3 4 3 4 2 3 1 2 3
145 146 147 148 149 150 151 152 153 154 155 156 157 158
3 2 1 4 1 3 3 3 4 3 1 1 3 3
> cc = std.ext(estim.class, classes)
> cc
$SS
[1] 2017
$SD
[1] 1675
$DS
[1] 1102
$DD
[1] 7609
> rand1 = clv.Rand(cc)
> rand1
[1] 0.7761026
>
> jacc1 = clv.Jaccard(cc)
> jacc1
[1] 0.4207343
>
> fm1 = clv.Folkes.Mallows(cc)
> fm1
[1] 0.5943843