This research aims to evaluate the effectiveness and consistency of
the three clustering evaluation metrics by applying them to three images
with distinct visual and structural properties:
- a part of the
Diego Rivera mural “Dream of a Sunday Afternoon in the Alameda Central”
(a highly complex, colorful image);
- a photo of pierogi (own photo,
has distinct object boundaries);
- a view of the Baltic Sea (a
natural image dominated by gradients and subtle variations).
By analyzing the results from these metrics and visually inspecting
the segmented images, this study seeks to identify patterns in how the
metrics perform across different types of images and explore the
conditions under which one method may be more suitable than
others.
For this study, the clustering algorithms K-means and CLARA
(Clustering Large Applications) were used due to their effectiveness and
scalability in handling image data. K-means is computationally efficient
and well-suited for image clustering, where pixels can be represented as
points in RGB space. It provides compact clusters, which makes it ideal
for segmenting images into distinct regions. CLARA is a variation of
k-medoids, optimized for large datasets. Instead of processing the
entire dataset, it uses random sampling to find cluster medoids, making
it faster and more memory-efficient.This study uses combination of
K-means and CLARA, balancing computational efficiency to analyze images.
To determine the optimal number of clusters (k), three widely used
metrics were employed: the Elbow Method, Silhouette Score, and
Calinski-Harabasz Index. Each metric evaluates clustering quality using
different criteria.
- Silhouette Score evaluates
the quality of clustering by measuring how similar a data point is to
its own cluster (cohesion) compared to other clusters (separation). A
higher Silhouette Score indicates better-defined clusters, and the
number of clusters with the highest score is considered optimal.
-
Elbow Method involves plotting the Within-Cluster Sum
of Squares (WCSS), also known as the Sum of Squared Errors (SSE),
against the number of clusters. The optimal number of clusters
corresponds to the “elbow point,” where the SSE starts to decrease more
gradually, indicating diminishing returns in clustering quality with
additional clusters.
- Calinski-Harabasz Index (Variance
Ratio Criterion) calculates the ratio of between-cluster
variance to within-cluster variance. Similar to other clustering
evaluation metrics such as Silhouette score, the CH index can be used to
find the optimal number of clusters k in algorithms like k-means, where
the value of k is not known apriori. Higher values CH indicate
better-defined clusters.
library(jpeg)
library(plotrix)
library(rasterImage)
library(imager)
library(ggplot2)
library(Rtsne)
library(cluster)
library(gridExtra)
library(fpc)
setwd("C:/Users/ydmar/Documents/UW/UW - 1 semester/UL")
mural<-readJPEG("diego_rivera.jpg")
class(mural)
## [1] "array"
Plot raster image.
plot(1, type="n")
rasterImage(mural, 0.6, 0.6, 1.4, 1.4)
Inspect the dimensions of the object.
dm1 <-dim(mural)
dm1
## [1] 675 1200 3
The image is a 675 x 1200 pixel RGB image. It has 675 rows and 1200 columns, making up a total of 810,000 pixels. Each pixel has 3 color values: red, green and blue.
To cluster the image further and put it on the plot, we have to change the format of images from jpg to rgb.
rgbMural<-data.frame(x=rep(1:dm1[2], each=dm1[1]),
y=rep(dm1[1]:1, dm1[2]),
r.value=as.vector(mural[,,1]),
g.value=as.vector(mural[,,2]),
b.value=as.vector(mural[,,3]))
head(rgbMural)
## x y r.value g.value b.value
## 1 1 675 0.9960784 1.0000000 0.9764706
## 2 1 674 1.0000000 0.9921569 0.9450980
## 3 1 673 0.9921569 0.9960784 0.9333333
## 4 1 672 1.0000000 0.9960784 0.9254902
## 5 1 671 0.9921569 1.0000000 0.9176471
## 6 1 670 0.9372549 0.8705882 0.7019608
plot(y~x,
data= rgbMural,
main="Diego Rivera mural",
col=rgb(rgbMural[c("r.value", "g.value", "b.value")]),
asp=1,
pch="."
)
Each row now represents a pixel with its coordinates (x, y) and
corresponding RGB color values.
Determine the optimal number of
clusters. Start with Silhouette Score.
n1<-c()
for (i in 1:10) {
clS_mural<-clara(rgbMural[, c("r.value", "g.value", "b.value")], i)
n1[i]<-clS_mural$silinfo$avg.width
}
plot(n1, type='l', main="Optimal number of clusters (Silhouette score)", xlab="Number of clusters", ylab="Average silhouette", col="blue")
points(n1, pch=21, bg="navyblue")
abline(h=(1:30)*5/100, lty=3, col="grey50")
clara_mural<-clara(rgbMural[,3:5], 4)
plot(silhouette(clara_mural))
Based on the Silhoutte Method, the average silhoutte width peaks at
4 clusters, showing that this is the configuration with best-defined
clustering structure. Cluster 1, 3, and 4 have a higher silhoutte width
indicating better-defined groupings. However, cluster 2 has low
silhoutte width, implying that the points in this cluster may not be
well-separated from points in other cluster. Overall, the silhoutte
width over 0.5 is reasonable, but the low value for Cluster 2 suggests
that ,additionally, another method could be applied.
Elbow Method.
n2<-c()
for (h in 1:10) {
clE_mural<-clara(rgbMural[, c("r.value", "g.value", "b.value")], h)
n2[h]<-clE_mural$objective
}
plot(n2, type = 'l', main = "Optimal Number of Clusters (Elbow Method)",
xlab = "Number of Clusters", ylab = "Within-Cluster Sum of Squares", col = "blue")
points(n2, pch = 21, bg = "navyblue")
abline(v = which.min(diff(diff(n2))), lty = 3, col = "red")
As we can see, according to the Elbow Method, 4 clusters is also
the optimal number, because it represents the point where adding more
clusters no longer results in a substantial reduction in WCSS. Since
both clustering methods suggest the same number of clusters, apply this
number.
Prepare color represenation of the clustered image.
coloursMural <-rgb(clara_mural$medoids[clara_mural$clustering, ])
Plot pixels in the new colours.
plot(rgbMural$y~rgbMural$x, col=coloursMural, pch=".", cex=2, asp=1, main="4 colours")
pierogi<-readJPEG("C:/Users/ydmar/Documents/UW/UW - 1 semester/UL/pierogi.jpg")
class(pierogi)
## [1] "array"
Repeat the same steps as were described above.
plot(1, type="n")
rasterImage(pierogi, 0.6, 0.6, 1.4, 1.4)
dm2 <-dim(pierogi)
dm2
## [1] 874 954 3
The image is a 874 x 954 pixel RGB image. It has 874 rows and 954 columns, making up a total of 833,796 pixels. Each pixel has 3 color values: red, green and blue.
rgbPierogi<-data.frame(x=rep(1:dm2[2], each=dm2[1]),
y=rep(dm2[1]:1, dm2[2]),
r.value=as.vector(pierogi[,,1]),
g.value=as.vector(pierogi[,,2]),
b.value=as.vector(pierogi[,,3]))
head(rgbPierogi)
## x y r.value g.value b.value
## 1 1 874 0.3803922 0.2352941 0.1333333
## 2 1 873 0.3803922 0.2352941 0.1333333
## 3 1 872 0.3725490 0.2274510 0.1254902
## 4 1 871 0.3803922 0.2352941 0.1333333
## 5 1 870 0.3686275 0.2235294 0.1215686
## 6 1 869 0.3686275 0.2078431 0.1215686
plot(y~x,
data=rgbPierogi,
main="Pierogi",
col=rgb(rgbPierogi[c("r.value", "g.value", "b.value")]),
asp=1,
pch="."
)
Silhouette score.
n3<-c()
for (p in 1:10) {
clS_pierogi <-clara(rgbPierogi[, c("r.value", "g.value", "b.value")], p)
n3[p]<-clS_pierogi$silinfo$avg.width
}
plot(n3, type='l', main="Optimal number of clusters (Silhouette score)", xlab="Number of clusters", ylab="Average silhouette", col="blue")
points(n3, pch=21, bg="navyblue")
abline(h=(1:30)*5/100, lty=3, col="grey50")
clara_pierogi2<-clara(rgbPierogi[,3:5], 2)
plot(silhouette(clara_pierogi2))
Based on the Silhoutte Method, the average silhoutte width peaks at
2 clusters, showing that this is the configuration with best-defined
clustering structure. The silhoutte width over 0.7 is reasonable. Check
whether the results obtained using the Elbow Method remain
consistent.
Elbow Method.
n4<-c()
for (d in 1:10) {
clE_pierogi<-clara(rgbPierogi[, c("r.value", "g.value", "b.value")], d)
n4[d]<-clE_pierogi$objective
}
plot(n4, type = 'l', main = "Optimal Number of Clusters (Elbow Method)",
xlab = "Number of Clusters", ylab = "Within-Cluster Sum of Squares", col = "blue")
points(n4, pch = 21, bg = "navyblue")
abline(v = which.min(diff(diff(n4))), lty = 3, col = "red")
Based on the Elbow Method, the optimal number of clusters for the
‘Pierogi’ image is 8. However, given the differing recommendations from
the Silhouette Score and Elbow Method, the Calinski-Harabasz (CH) Index
will be used as an additional metric to ensure a more reliable approach
to indicate the optimal number of clusters.
Calinski-Harabasz
index
The Calinski-Harabasz index evaluates cluster quality by
measuring the ratio of between-cluster dispersion to within-cluster
dispersion. To address computational and memory limitations, a sampling
approach was applied. A sample size of 10,000 observations is quite
sufficient, as it preserves the essential structure of the dataset.
Given the dataset’s uniform patterns and consistent distribution of RGB
values, the sample is expected to be representative of the entire
dataset. Utilizing a fixed random seed function ensures reproducibility
and consistency in the sampling process.
set.seed(123)
sampled_pierogi <- rgbPierogi[sample(1:nrow(rgbPierogi), size = 10000), ]
ch_indices_pierogi <- sapply(2:10, function(k) {
clara_pierogich <- clara(sampled_pierogi, k)
cluster.stats(dist(sampled_pierogi), clara_pierogich$clustering)$ch
})
ch_indices_pierogi
## [1] 5233.343 7281.453 7156.269 9114.986 8814.928 8870.446 9158.971 9087.849
## [9] 9197.451
plot(2:10, ch_indices_pierogi, type = "b",
xlab = "Number of Clusters (k)", ylab = "Calinski-Harabasz Index",
main = "Calinski-Harabasz Index for K-means Clustering",
col = "navyblue", pch = 16)
As we can see, the highest CH index value of 9197.451 suggests that
the best number of clusters for the image is 10. However, this
represents a slight variation in cluster count compared to other
methods. The next highest CH index value, 9158.971, occurs at 8
clusters, which is only marginally lower than the highest value.
Given this close proximity in CH index values and the fact that the
Elbow Method also identifies 8 clusters as optimal, 8 clusters are
selected as the most appropriate number for this image. This decision
balances consistency between methods.
clara_pierogi8<-clara(rgbPierogi[,3:5], 8)
coloursPierogi8<-rgb(clara_pierogi8$medoids[clara_pierogi8$clustering, ])
plot(rgbPierogi$y~rgbPierogi$x, col=coloursPierogi8, pch=".", cex=2, asp=1, main="8 colours")
setwd("C:/Users/ydmar/Documents/UW/UW - 1 semester/UL")
gdansk_sea <-readJPEG("Gdansk_Sea.jpg")
class(gdansk_sea)
## [1] "array"
Repeat the same steps as were described above.
plot(1, type="n")
rasterImage(gdansk_sea, 0.6, 0.6, 1.4, 1.4)
dm3 <-dim(gdansk_sea)
dm3
## [1] 675 1200 3
The image is a 675 x 1200 pixel RGB image. It has 675 rows and 1200 columns, making up a total of 810,000 pixels. Each pixel has 3 color values: red, green and blue.
rgbGdansk<-data.frame(x=rep(1:dm3[2], each=dm3[1]),
y=rep(dm3[1]:1, dm3[2]),
r.value=as.vector(gdansk_sea[,,1]),
g.value=as.vector(gdansk_sea[,,2]),
b.value=as.vector(gdansk_sea[,,3]))
head(rgbGdansk)
## x y r.value g.value b.value
## 1 1 675 0.8470588 0.8980392 0.9725490
## 2 1 674 0.8470588 0.8980392 0.9725490
## 3 1 673 0.8509804 0.9019608 0.9764706
## 4 1 672 0.8549020 0.9058824 0.9803922
## 5 1 671 0.8588235 0.9098039 0.9764706
## 6 1 670 0.8627451 0.9137255 0.9803922
plot(y~x,
data=rgbGdansk,
main="Gdansk_BalticSea",
col=rgb(rgbGdansk[c("r.value", "g.value", "b.value")]),
asp=1,
pch="."
)
Silhoutte score.
n5<-c()
for (g in 1:10) {
clS_Gdansk<-clara(rgbGdansk[, c("r.value", "g.value", "b.value")], g)
n5[g]<-clS_Gdansk$silinfo$avg.width
}
plot(n5, type='l', main="Optimal number of clusters (Silhouette score)", xlab="Number of clusters", ylab="Average silhouette", col="blue")
points(n5, pch=21, bg="navyblue")
abline(h=(1:30)*5/100, lty=3, col="grey50")
claraS_Gdansk <-clara(rgbGdansk[,3:5], 2)
plot(silhouette(claraS_Gdansk))
Based on the Silhoutte Method, the average silhoutte width peaks at
2 clusters. The silhoutte width of 0.6 is reasonable. Check whether the
results obtained using the Elbow Method remain consistent.
Elbow
Method.
n6<-c()
for (s in 1:10) {
clE_Gdansk<-clara(rgbGdansk[, c("r.value", "g.value", "b.value")], s)
n6[s]<-clE_Gdansk$objective
}
plot(n6, type = 'l', main = "Optimal Number of Clusters (Elbow Method)",
xlab = "Number of Clusters", ylab = "Within-Cluster Sum of Squares", col = "blue")
points(n6, pch = 21, bg = "navyblue")
abline(v = which.min(diff(diff(n4))), lty = 3, col = "red")
Similar to the previous image, the Elbow Method indicates that 8
clusters are optimal. To ensure a comprehensive evaluation, apply CH
index. Calinski-Harabasz index.
set.seed(123)
sampled_Gdansk <- rgbGdansk[sample(1:nrow(rgbGdansk), size = 10000), ]
ch_indices_Gdansk <- sapply(2:10, function(v) {
clara_GdanskCh <- clara(sampled_Gdansk, v)
cluster.stats(dist(sampled_Gdansk), clara_GdanskCh$clustering)$ch
})
ch_indices_Gdansk
## [1] 13044.503 10133.821 9690.864 8753.373 10943.635 10493.068 10325.053
## [8] 9619.297 9994.434
plot(2:10, ch_indices_Gdansk, type = "b",
xlab = "Number of Clusters (k)", ylab = "Calinski-Harabasz Index",
main = "Calinski-Harabasz Index for K-means Clustering",
col = "navyblue", pch = 16)
The highest CH index (13044.503) indicates that the optimal number
of clusters for Gdansk image is 2. This conclusion aligns with the
recommendation from the Silhouette Score, which also suggests that 2
clusters provide the best balance between cluster cohesion and
separation. Based on the results from these two methods will proceed
with applying a 2-cluster configuration to the image.
coloursGdansk<-rgb(claraS_Gdansk$medoids[claraS_Gdansk$clustering, ])
plot(rgbGdansk$y~rgbGdansk$x, col=coloursGdansk, pch=".", cex=2, asp=1, main="2 colours")
From the conducted study we can state, that the Elbow Method tended
to suggest higher numbers of clusters, particularly for images with
gradual transitions or intricate details (“Pierogi” and “Baltic Sea”
images). This is because the Elbow Method primarily focuses on
minimizing within-cluster variance (WCSS), and adding more clusters
inherently reduces this value. For images like the “Baltic Sea”,
dominated by gradients and fewer distinct regions, simpler clustering (2
clusters) provides a more meaningful segmentation. The Elbow Method’s
suggestion of 8 clusters for this image reflects its tendency to
over-segment gradient-heavy datasets, capturing insignificant pixel
variations rather than the broader structure of the image.
Images
like the Diego Rivera mural, which feature intricate patterns and
diverse color palettes, require more clusters to accurately represent
their complexity. Both the Elbow Method and Silhouette Score identified
4 clusters as optimal, demonstrating their ability to capture the
distinct regions and details in such images.
Agreement between the
Silhouette Score and the CH Index indicates strong clustering quality,
as both metrics consider aspects of cluster separation and compactness.
When methods disagree, their differences provide insights into the
nature of the data. For example, the disagreement in the “Pierogi” image
highlights a trade-off between simplicity (fewer clusters, Silhouette
Score) and granularity (more clusters, CH index).
Summarizing the
above mentioned results, the following approaches of selecting
clustering methods depends on the characteristics of the image may be
advised:
- Silhoutte score and
Calinski-Harabasz Index for the images with color
gradient;
- Elbow Method and
Calinski-Harabasz Index for images with distinct
objects transitioning in size, shape, or position but with fewer
distinct colors.
- Silhoutte score ,Elbow
Method and Calinski-Harabasz Index for images
with patterns, diverse colors and detailed structure.