::include_graphics("img/clustering.png") knitr
Multivariate Analysis Lecture 1617: Cluster Analysis
Outline
Introduction
Model-free cluster analysis
K-Means
Model-based cluster analysis
Gaussian mixture model
Choose the number of clusters
Introduction
Introduction to Cluster Analysis
Exploratory
Aims to understand/ learn the unknown substructure of multivariate data
Seeks rules to group data
- Large between-group difference
- Small within-group difference
Cluster Analysis vs Classification Analysis
- Cluster analysis
- Data are unlabeled
- The number of clusters are unknown
- “Unsupervised” learning
- Classification
- The labels for training data are known
- The number of classes are known
- Want to allocate new observations, whose labels are unknown, to one of the known classes
- “Supervised” learning
Clustering Methods
- Model-free:
- Hierarchical clustering. Based on similarity measures
- Nonhierarchical clustering. K-means
- Model-based clustering
K-Means
Introduction
- The idea goes to back Hugo Steinhaus (1956); the name was first used by James MacQueen (1967)
- K-means is a model-free clustering method
- Is is a type of nonhierarchical clustering
- Assign each observation to the cluster with the nearest centroid
- “Nearest” is usually defined on Euclidean distance
- The goal is to minimize within-cluster variances
- We need to decide whether or not to standardize observations
- There are variations such as more robust versions (k-medians, k-medoids) and model-based (Gaussian mixture)
K-Means vs K-Nearest-Neighbors (KNN)
- Both are non-parametric
- K-means: unsupervised
- KNN: supervised
- K:
- the number of neighbors in KNN
- the number of clusters in K-Means
How Does K-means Work
- We assume the number of classes, denoted by \(K\), is known
- Let \(\mathbf X_1, \cdots, \mathbf X_n\) denote \(n\) observations in \(\mathbb R^p\).
- Let \(C_1, \cdots, C_K\) be the indices of membership
- Want to find \((C_1, \cdots, C_K)\) to minimize the objective function
\[\underset{C_1, \cdots, C_K}{min} \{ \sum_{k=1}^K \frac{1}{|C_k|}\sum_{(i,i')\in C_k} D^2(\mathbf X_i, \mathbf X_{i'}) \}\]
How Does K-means Work
The objective function is equivalent to \[\underset{Z_1, \cdots, Z_K}{min} \{ \sum_{i=1}^n D^2(\mathbf X_i, \bar {\mathbf X}_{Z_i}) \}\] where \(\bar {\mathbf X}_{Z_i}\) is the centriod of cluster \(C_i\).
One widely used choice of \(D\) is the Euclidean distance, i.e., \[D^2(x,y)=||x-y||^2=(x-y)^T (x-y)\]
K-Means: Algorithm
- Step 0: standardize data if appropriate
- Step 1: Partition the observations into K initial clusters. Alternatively, we can initialize K centroids
- Step 2: Assign each observation to its nearest cluster.
- Step 3: recalculate the centroids.
- Step 4: repeat 2-3 until no more changes in assignments
Figures of K-means of Iris
cluster analysis using the R function “kmeans”
=kmeans(data.frame(iris[,-5]),centers=3) obj
An Animated Figure of Clusters of Iris Data
Code
##################################
## Define Squared Euclidean Distance
=function(x,y) { return(sum( (x-y)^2)) }
distxy
#prepare the scatter plots for an animated gif
="C:/Users/yu790/Desktop/Desktop/teaching-multivariate/img/Iris_kmean/scatter"
dir_outdir.create(dir_out, recursive = TRUE, showWarnings = FALSE)
=iris[,-5]
mydata#mydata=scale(mydata)
=dim(mydata)[1]
n=3
K=matrix(0, n, K)
distance#initialize
set.seed(1)
=rmultinom(n, size=1, prob=c(0.3,0.3,0.4) )
cluster01=rmultinom(n, size=1, prob=c(0.3,0.3,0.4) )
cluster01.new=apply(cluster01.new, 2, which.max)
cluster123
=0
iwhile( sum(cluster01!=cluster01.new) )
{=i+1
iif (i < 10) {name = paste('000',i,'plot.png',sep='')}
if (i < 100 && i >= 10) {name = paste('00',i,'plot.png', sep='')}
if (i >= 100) {name = paste('0', i,'plot.png', sep='')}
png(filename=paste(dir_out,"/", name, sep=""))
plot(mydata[,3],mydata[,4], type="n",xlab="Petal L",ylab="Petal W",
main=paste("K-mean Clustering (Iris Data): iteration=", i))
for(k in 1:K)
points(mydata[cluster123==k,3], mydata[cluster123==k,4],col=k)
dev.off()
=as.matrix(cluster01.new)%*%as.matrix(mydata)/rowSums(cluster01.new)
centroids##to avoid switch label
=centroids[order(centroids[,1]),]
centroids=cluster01.new
cluster01
for(k in 1:K)
=apply(mydata, 1, distxy, centroids[k,])
distance[,k]
=apply(distance, 1, which.min)
cluster123#convert the assignment to dummy variables
=matrix(0, K, n)
cluster01.newfor(k in 1:K)
==k]=1
cluster01.new[k,cluster123
print(i)
print(table(cluster123))
}
[1] 1
cluster123
1 2 3
22 49 79
[1] 2
cluster123
1 2 3
50 30 70
[1] 3
cluster123
1 2 3
50 36 64
[1] 4
cluster123
1 2 3
50 40 60
[1] 5
cluster123
1 2 3
50 45 55
[1] 6
cluster123
1 2 3
50 49 51
[1] 7
cluster123
1 2 3
50 54 46
[1] 8
cluster123
1 2 3
50 57 43
[1] 9
cluster123
1 2 3
50 60 40
[1] 10
cluster123
1 2 3
50 61 39
[1] 11
cluster123
1 2 3
50 61 39
Code
<- list.files(dir_out, full.names = TRUE)
imgs <- lapply(imgs, image_read)
img_list
## join the images together
<- image_join(img_list)
img_joined
## animate at 2 frames per second
<- image_animate(img_joined, fps = 2)
img_animated
## view animated image
img_animated
Code
## save to disk
image_write(image = img_animated,
path = "img/iris_kmeans_animated_scatter.gif")
Cluster Memberships of Iris Data (K-Means)
::include_graphics("img/Iris_kmeans_cluster_memberships_steps1to9.png") knitr
An animated figure of cluster memberships of iris k-means
Code
## Define Squared Euclidean Distance
=function(x,y) { return(sum( (x-y)^2)) }
distxy
#prepare the plots of cluster assignments for an animated gif
="C:/Users/yu790/Desktop/Desktop/teaching-multivariate/img/Iris_kmean/assignment"
dir_outdir.create(dir_out, recursive = TRUE, showWarnings = FALSE)
=iris[,-5]
mydata#mydata=scale(mydata) #if we would like to use standardized data
=dim(mydata)[1]
n=3
K=matrix(0, n, K)
distance
#initialize clusters
set.seed(1)
=rmultinom(n, size=1, prob=c(0.3,0.3,0.4) )
cluster01=rmultinom(n, size=1, prob=c(0.3,0.3,0.4) )
cluster01.new=apply(cluster01.new, 2, which.max)
cluster123#cluster01.new is 3-by-150
=0
iwhile( sum(cluster01!=cluster01.new) )
{=i+1
iif (i < 10) {name = paste('000',i,'plot.png',sep='')}
if (i < 100 && i >= 10) {name = paste('00',i,'plot.png', sep='')}
if (i >= 100) {name = paste('0', i,'plot.png', sep='')}
png(filename=paste(dir_out,"/", name, sep=""))
=cluster01.new
tmpcolnames(tmp)=1:n
barplot(cluster01.new, ylim=c(0,1), col=c(1:3),
main=paste("K-mean Clustering (Iris Data): iteration=", i),
xlab="Iris Species: Setosa, Versicolor, Virginica")
dev.off()
=as.matrix(cluster01.new)%*%as.matrix(mydata)/rowSums(cluster01.new)
centroids##to avoid switch label
=centroids[order(centroids[,1]),]
centroids=cluster01.new
cluster01
for(k in 1:K)
=apply(mydata, 1, distxy, centroids[k,])
distance[,k]=apply(distance, 1, which.min) #distance is 150-by-3
cluster123#convert the assignment to dummy variables
=matrix(0, K, n)
cluster01.newfor(k in 1:K)
==k]=1
cluster01.new[k,cluster123print(i)
print(table(cluster123))
}
[1] 1
cluster123
1 2 3
22 49 79
[1] 2
cluster123
1 2 3
50 30 70
[1] 3
cluster123
1 2 3
50 36 64
[1] 4
cluster123
1 2 3
50 40 60
[1] 5
cluster123
1 2 3
50 45 55
[1] 6
cluster123
1 2 3
50 49 51
[1] 7
cluster123
1 2 3
50 54 46
[1] 8
cluster123
1 2 3
50 57 43
[1] 9
cluster123
1 2 3
50 60 40
[1] 10
cluster123
1 2 3
50 61 39
[1] 11
cluster123
1 2 3
50 61 39
Code
## list file names and read in
<- list.files(dir_out, full.names = TRUE)
imgs <- lapply(imgs, image_read)
img_list
## join the images together
<- image_join(img_list)
img_joined
## animate at 2 frames per second
<- image_animate(img_joined, fps = 2)
img_animated
## view animated image
img_animated
Code
## save to disk
image_write(image = img_animated,
path = "img/iris_kmeans_animated_assignment.gif")
Remarks
The algorithm is guaranteed to decrease the objective function
It might converge to local minimum
Use different and random initial values
Different Initial Values
::include_graphics("img/Iris_kmeans_InitialValues.png") knitr
Different Initial Values
kmeans(data.frame(iris[,-5]),centers=3, nstart=5)
K-means clustering with 3 clusters of sizes 62, 38, 50
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.901613 2.748387 4.393548 1.433871
2 6.850000 3.073684 5.742105 2.071053
3 5.006000 3.428000 1.462000 0.246000
Clustering vector:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
[112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
[149] 2 1
Within cluster sum of squares by cluster:
[1] 39.82097 23.87947 15.15100
(between_SS / total_SS = 88.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Model-based
Mixture Model
- Consider a random variable \(X\)
- We say it follows a mixture of \(K\) distributions if its pdf/pmf can be written as a weighted sum of \(K\) pdf/pmf’s:
- The weights \(p_k, k=1, \cdots, K\) are nonnegative numbers and they add up to 1
Example: Gaussian Mixture Models
- Let \(f_k(x)\) be the pdf of \(N(\boldsymbol \mu_k, \boldsymbol\Sigma_k)\)
- We say \(X\) follow a Gassian mixture model if
\[\begin{aligned} f_{Mix}(x|\boldsymbol \mu_1, \boldsymbol \Sigma_1, \cdots, \boldsymbol \mu_K, \boldsymbol \Sigma_K, p_1, \cdots, p_K)\\ =\sum_{k=1}^Kp_k \frac{1}{(1\pi)^{p/2}|\boldsymbol\Sigma_k|^{1/2}}exp\left(-\frac{1}{2} (x-\boldsymbol\mu_k)^T \boldsymbol\Sigma_k^{-1} (x-\boldsymbol \mu_k)\right) \end{aligned}\]
Examples: Mixtures of Two Univariate Normal
Code
# Load required library
# Set the parameters
<- 1
mean1 <- 0.25
var1 <- 0
mean0 <- 0.25
var0
# Generate data points
<- seq(-2, 3, length.out = 1000)
x <- dnorm(x, mean = mean1, sd = sqrt(var1))
y1 <- dnorm(x, mean = mean0, sd = sqrt(var0))
y0 <- 0.7 * y1 + 0.3 * y0
ymix_70 <- 0.5 * y1 + 0.5 * y0
ymix_eq <- 0.2 * y1 + 0.8 * y0
ymix_20
# Create a data frame for plotting
<- data.frame(x = x, y1 = y1, y0 = y0, ymix_70 = ymix_70,
df ymix_eq = ymix_eq, ymix_20 = ymix_20)
# Create the plot
<- ggplot(df, aes(x)) +
plot1 geom_line(aes(y = y1, color = "Distribution 1"), size = 1) +
geom_line(aes(y = y0, color = "Distribution 0"), size = 1) +
geom_line(aes(y = ymix_eq, color = "Mixture"), size = 1) +
scale_color_manual(values = c("Distribution 1" = "blue", "Distribution 0" = "green", "Mixture" = "red")) +
labs(x = "x", y = "Density", title = "Mixture of Two Normal 0.5 and 0.5") +
theme_minimal()
<- ggplot(df, aes(x)) +
plot2 geom_line(aes(y = y1, color = "Distribution 1"), size = 1) +
geom_line(aes(y = y0, color = "Distribution 0"), size = 1) +
geom_line(aes(y = ymix_70, color = "Mixture"), size = 1) +
scale_color_manual(values = c("Distribution 1" = "blue", "Distribution 0" = "green", "Mixture" = "red")) +
labs(x = "x", y = "Density", title = "Mixture of Two Normal 0.7 and 0.3") +
theme_minimal()
<- ggplot(df, aes(x)) +
plot3 geom_line(aes(y = y1, color = "Distribution 1"), size = 1) +
geom_line(aes(y = y0, color = "Distribution 0"), size = 1) +
geom_line(aes(y = ymix_20, color = "Mixture"), size = 1) +
scale_color_manual(values = c("Distribution 1" = "blue", "Distribution 0" = "green", "Mixture" = "red")) +
labs(x = "x", y = "Density", title = "Mixture of Two Normal 0.2 and 0.8") +
theme_minimal()
# Display the plot
grid.arrange(plot1, plot2, plot3, ncol = 1, nrow = 3)
Gaussian Mixture Model
The likelihood function based on \(n\) observations is \[L(\boldsymbol \theta)=\prod_{i=1}^n f_{Mix}(x_i|\boldsymbol\theta)\] where \(\boldsymbol\theta\) represents all the parameters.
There is no analytic solution to obtain the Maximum likelihood estimation (MLE)
The expectation-maximization (EM) algorithm can be used to estimate parameters
The Expectation Maximization (EM) Algorithm
- Formally outlined by Dempster, Laird, and Rubin (1977) in “Maximum likelihood from incomplete data via the EM algorithm”
- Observed data: \(y^O\)
- Unobserved data: \(y^U\)
- Complete data: \(y^C=(y^O, y^U)\)
- Parameter \(\boldsymbol\theta\)
The EM Algorithm
Initialize parameters: set \(\boldsymbol\theta=\boldsymbol\theta^{(0)}\)
For \(t=1\) to
- E-step: Compute
\[Q(\boldsymbol\theta| \boldsymbol\theta^{(t)}) = E[f(y^C|\boldsymbol\theta)|y^O, \boldsymbol\theta^{(t)}]\]
This step is equivalent to compute conditional expectations in most situations
- M-step: Update parameters
\[\boldsymbol\theta^{(t+1)}=argmax_{\boldsymbol\theta}Q(\boldsymbol\theta| \boldsymbol\theta^{(t)})\]
The EM Algorithm
- Ascent property
- The M step ensures the algorithm improves \(Q(\boldsymbol\theta| \boldsymbol\theta^{(t)})\)
- It can be shown that improving \(Q(\boldsymbol\theta| \boldsymbol\theta^{(t)})\) implies improving \(L(\boldsymbol\theta)=f(y^O|\boldsymbol\theta)\) (the observed-data likelihood)
- Convergence to local maxim
- Choose multiple sets of inital values
Example of EM Algorithm
- Observed data
- x=(0.37 1.18 0.16 2.60 1.33 0.18 1.49 1.74 3.58 2.69 4.51 3.39 2.38 0.79 4.12 2.96 2.98 3.94 3.82 3.59)
- Assume \(K=2\)
- Parameters: \(\mu_1, \mu_0, \sigma_1, \sigma_0, p\)
- Unobserved data: \(z=(\cdots)\), the class membership, a vector of binary variables
Initialization
- Parameters need to be initialized
- Different initial values might lead to different MLEs
- Use different initial values and check which one gives the largest likelihood
E-Step
::include_graphics("img/EM1.png") knitr
E-Step
::include_graphics("img/EM2.png") knitr
M-Step
::include_graphics("img/EM3.png") knitr
Code the EM in R
Code
##### gif animation for simulated data. Gaussian Mixture model, EM algorithm
="C:/Users/yu790/Desktop/Desktop/teaching-multivariate/img/EM"
dir_outdir.create(dir_out, recursive = TRUE, showWarnings = FALSE)
set.seed(1);
=c(rnorm(8,1), rnorm(12,3))
x=length(x)
n=rbinom(n,1,0.5)
z=c(1:n)[!z]
k0=setdiff(1:n, k0)
k1=mean(z)
p.old=c(mean(x[k0]), mean(x[k1]) )
u.old=c(sd(x[k0]), sd(x[k1]) )
sd.old=p.old
p.new=u.old
u.new=sd.old
sd.new
=0
iwhile(i==0 || max(abs(u.new-u.old))>0.0001)
{=p.new
p.old=u.new
u.old=sd.new
sd.old
##### E-Step
=p.old*dnorm(x, u.old[2], sd.old[2]) /
ez11-p.new)*dnorm(x, u.old[1], sd.old[1]) + p.old*dnorm(x, u.old[2], sd.old[2]))
((=1-ez1
ez0
##### M-step
=mean(ez1)
p.new=c( weighted.mean(x, ez0), weighted.mean(x, ez1))
u.new=c( sqrt(weighted.mean((x-u.new[1])^2, ez0)) ,
sd.newsqrt(weighted.mean((x-u.new[2])^2, ez1)))
=i+1
iif(i==1) cat("\t", "p \t", "u0 \t", "u1 \n")
cat("iteration", i, "\n")
print(c(p.new, u.new))
if (i < 10) {name = paste('000',i,'plot.png',sep='')}
if (i < 100 && i >= 10) {name = paste('00',i,'plot.png', sep='')}
if (i >= 100) {name = paste('0', i,'plot.png', sep='')}
png(filename=paste(dir_out,"/", name, sep=""))
=rbind(ez1,1-ez1)
tmpcolnames(tmp)=1:n
barplot(tmp, ylim=c(0,1),col=c(2,4),main=paste("Estimated Prob from EM: iteration=", i),ylab="prob")
dev.off()
}
p u0 u1
iteration 1
[1] 0.5486808 2.7438535 2.0998914
iteration 2
[1] 0.5499383 2.7797870 2.0719564
iteration 3
[1] 0.5510853 2.8159372 2.0439817
iteration 4
[1] 0.5523403 2.8560647 2.0132132
iteration 5
[1] 0.5538187 2.9025178 1.9780384
iteration 6
[1] 0.5555898 2.9574019 1.9370844
iteration 7
[1] 0.5576711 3.0228999 1.8889413
iteration 8
[1] 0.5599612 3.1009544 1.8322405
iteration 9
[1] 0.5620851 3.1919255 1.7661598
iteration 10
[1] 0.5631783 3.2917851 1.6914726
iteration 11
[1] 0.5618823 3.3892693 1.6117699
iteration 12
[1] 0.5571285 3.4680534 1.5339762
iteration 13
[1] 0.5496536 3.5180095 1.4667436
iteration 14
[1] 0.5410559 3.5420960 1.4137166
iteration 15
[1] 0.531741 3.548049 1.371190
iteration 16
[1] 0.5218244 3.5432627 1.3342075
iteration 17
[1] 0.5114847 3.5329410 1.2994096
iteration 18
[1] 0.5008803 3.5199750 1.2650429
iteration 19
[1] 0.4901394 3.5057700 1.2304049
iteration 20
[1] 0.4793735 3.4909783 1.1953685
iteration 21
[1] 0.4686951 3.4759362 1.1601186
iteration 22
[1] 0.458235 3.460886 1.125049
iteration 23
[1] 0.4481546 3.4460795 1.0907414
iteration 24
[1] 0.4386503 3.4318119 1.0579664
iteration 25
[1] 0.4299456 3.4184276 1.0276513
iteration 26
[1] 0.422263 3.406285 1.000768
iteration 27
[1] 0.4157754 3.3956927 0.9781162
iteration 28
[1] 0.4105534 3.3868395 0.9600769
iteration 29
[1] 0.4065404 3.3797458 0.9464773
iteration 30
[1] 0.4035762 3.3742769 0.9366876
iteration 31
[1] 0.4014519 3.3701973 0.9298717
iteration 32
[1] 0.3999615 3.3672344 0.9252231
iteration 33
[1] 0.3989304 3.3651268 0.9220866
iteration 34
[1] 0.3982234 3.3636510 0.9199797
iteration 35
[1] 0.3977415 3.3626292 0.9185661
iteration 36
[1] 0.3974142 3.3619276 0.9176173
iteration 37
[1] 0.3971925 3.3614485 0.9169798
iteration 38
[1] 0.3970425 3.3611226 0.9165512
iteration 39
[1] 0.3969411 3.3609016 0.9162627
iteration 40
[1] 0.3968727 3.3607519 0.9160685
iteration 41
[1] 0.3968265 3.3606507 0.9159376
iteration 42
[1] 0.3967953 3.3605824 0.9158494
Code
## list file names and read in
<- list.files(dir_out, full.names = TRUE)
imgs <- lapply(imgs, image_read)
img_list
## join the images together
<- image_join(img_list)
img_joined
## animate at 2 frames per second
<- image_animate(img_joined, fps = 2)
img_animated
## view animated image
img_animated
Code
## save to disk
image_write(image = img_animated,
path = "img/EM.gif")
Gaussian Mixture For Iris Data
Code
##### gif animation for iris data. Gaussian Mixture model, EM algorithm
### two species: Setosa, Versicolor. One variable: petal length
library(MASS)
="C:/Users/yu790/Desktop/Desktop/teaching-multivariate/img/Iris_GMixture"
dir_outdir.create(dir_out, recursive = TRUE, showWarnings = FALSE)
set.seed(1);
=c(iris3[,3,1],iris3[,3,2])
x=length(x)
n=rbinom(n,1,0.5)
z=c(1:n)[!z]
k0=setdiff(1:n, k0)
k1=mean(z)
p.old=c(mean(x[k0]), mean(x[k1]) )
u.old=c(sd(x[k0]), sd(x[k1]) )
sd.old=p.old
p.new=u.old
u.new=sd.old
sd.new
=0
iwhile(i==0 || max(abs(u.new-u.old))>0.0001)
{=p.new
p.old=u.new
u.old=sd.new
sd.old
#e-step
=p.old*dnorm(x, u.old[2], sd.old[2]) /
ez11-p.new)*dnorm(x, u.old[1], sd.old[1]) + p.old*dnorm(x, u.old[2], sd.old[2]))
((=1-ez1
ez0
#m-step
=mean(ez1)
p.new=c( weighted.mean(x, ez0), weighted.mean(x, ez1))
u.new=c( sqrt(weighted.mean((x-u.new[1])^2, ez0)) ,
sd.newsqrt(weighted.mean((x-u.new[2])^2, ez1)))
=i+1
iif(i==1) cat("\t", "p \t", "u0 \t", "u1 \n")
cat("iteration", i, "\n")
print(c(p.new, u.new))
if (i < 10) {name = paste('000',i,'plot.png',sep='')}
if (i < 100 && i >= 10) {name = paste('00',i,'plot.png', sep='')}
if (i >= 100) {name = paste('0', i,'plot.png', sep='')}
png(filename=paste(dir_out,"/", name, sep=""))
=rbind(ez1,1-ez1)
tmpcolnames(tmp)=1:n
barplot(tmp, ylim=c(0,1),col=c(2,4),main=paste("Estimated Prob (Iris Data): iteration=", i),ylab="prob")
dev.off()
}
p u0 u1
iteration 1
[1] 0.4787883 3.1563362 2.5394953
iteration 2
[1] 0.4782697 3.1680802 2.5260153
iteration 3
[1] 0.4778071 3.1804369 2.5118890
iteration 4
[1] 0.4772961 3.1940672 2.4962462
iteration 5
[1] 0.4767149 3.2092686 2.4787092
iteration 6
[1] 0.4760469 3.2263743 2.4588568
iteration 7
[1] 0.4752706 3.2458171 2.4361371
iteration 8
[1] 0.4743576 3.2681770 2.4098015
iteration 9
[1] 0.4732687 3.2942509 2.3788071
iteration 10
[1] 0.4719481 3.3251648 2.3416565
iteration 11
[1] 0.4703152 3.3625623 2.2961238
iteration 12
[1] 0.4682516 3.4089313 2.2387669
iteration 13
[1] 0.465584 3.468194 2.164038
iteration 14
[1] 0.4620888 3.5467770 2.0626964
iteration 15
[1] 0.4576889 3.6553706 1.9197582
iteration 16
[1] 0.4540255 3.8101438 1.7196361
iteration 17
[1] 0.462824 4.022454 1.512960
iteration 18
[1] 0.4940374 4.2282204 1.4607771
iteration 19
[1] 0.4999789 4.2598997 1.4619823
iteration 20
[1] 0.4999995 4.2599976 1.4619996
Code
## list file names and read in
<- list.files(dir_out, full.names = TRUE)
imgs <- lapply(imgs, image_read)
img_list
## join the images together
<- image_join(img_list)
img_joined
## animate at 2 frames per second
<- image_animate(img_joined, fps = 2)
img_animated
## view animated image
img_animated
Code
## save to disk
image_write(image = img_animated,
path = "img/iris_GMixture.gif")
Choose the Optimal Number of Clusters
- Model-based method: introduce a penalty term for \(K\)
- AIC: maximize
\[AIC=2\ln L_{max}-2n\left(K\frac{1}{2}(p+1)(p+2)-1 \right)\]
- BIC: maximize \[BIC=2\ln L_{max}-2\ln n\left(K\frac{1}{2}(p+1)(p+2)-1 \right)\]
- Model-free:
- Elbow method
- Silhouette analysis (Peter J. Rousseuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”. Computational and Applied Mathematics. 20: 53–65. Available in “cluster” package)
- Gap statistic method (Tibshirani, R., Walther, G. and Hastie, T., 2001. Estimating the number of clusters in a data set via the gap statistic. JRSSB, 63: 411-423. Also available in package “cluster”)
Compare Cluster Results
Rand index
Adjusted rand index