Unsupervised Learning - Image Clustering

Data preprocesing

Loading needed libraries

At the very beginning, needed libraries should be loaded.

library(grid)
library(jpeg)
library(rasterImage)
library(cluster)

Data dimensions

Then lets check the dimensions of our data (pictures Colorwheel and Rainbow)

# Picture 1. COLOURWHEEL
# Get the coordinates of pixels - RGB info in three columns
dm_colorwheel<-dim(colorwheel)
dm_colorwheel

## [1] 1920 1920    3

It should be interpreted as followed: - The image has a width of 1920 pixels. - The image has a height of 1920 pixels. - The image has three color channels (RGB - Red, Green, Blue) for each pixel.

# Picture 2. RAINBOW
# Get the coordinates of pixels - RGB info in three columns
dm_rainbow<-dim(rainbow)
dm_rainbow

## [1]  851 1280    3

Analogically, for Rainbow:

The image has a width of 851 pixels.
The image has a height of 1280 pixels.
The image has three color channels for each pixel.

Converting images to RGB colors

Now lets transform the pictures to data.frame objects representing RGB colors, and then plot each of them.

rgb_colorwheel<-data.frame(x=rep(1:dm_colorwheel[2], each=dm_colorwheel[1]),  y=rep(dm_colorwheel[1]:1, dm_colorwheel[2]), r.value=as.vector(colorwheel[,,1]),  g.value=as.vector(colorwheel[,,2]), b.value=as.vector(colorwheel[,,3]))

plot(y~x, data=rgb_colorwheel, main="Colorwheel RGB", col=rgb(rgb_colorwheel[c("r.value", "g.value", "b.value")]), asp=1, pch=".")

rgb_rainbow<-data.frame(x=rep(1:dm_rainbow[2], each=dm_rainbow[1]),  y=rep(dm_rainbow[1]:1, dm_rainbow[2]), r.value=as.vector(rainbow[,,1]),  g.value=as.vector(rainbow[,,2]), b.value=as.vector(rainbow[,,3]))

plot(y~x, data=rgb_rainbow, main="Rainbow RGB", col=rgb(rgb_rainbow[c("r.value", "g.value", "b.value")]), asp=1, pch=".")

Apply CLARA and get Silhouette

CLARA conducts PAM (partitioning around medoids) on various samples drawn from the dataset and selects the best clustering result as the output. The benefit of CLARA lies in its ability to handle larger datasets compared to PAM.

The Silhouette Index is a metric used to measure how well-defined and separated clusters are in a clustering analysis. It quantifies the similarity of an object to its own cluster (cohesion) compared to other clusters (separation).

The index ranges from -1 to 1, where:

A high positive value indicates that the object is well-matched to its own cluster and poorly matched to neighboring clusters, suggesting a well-defined cluster.
A value near 0 suggests overlapping clusters or ambiguous assignments.
A negative value indicates that the object is likely placed in the wrong cluster.

In the context of cluster validation, a higher average Silhouette Index across all data points often indicates a more appropriate number of clusters.

The data is ready for clustering, the crucial step is to choose appropriate number of clusters. A popular method for making this choice is the Silhouette Index. Therefore, we check the index value for the number of clusters ranging from 1 to 10. We do not consider a larger number of clusters to avoid overfitting the data. Our goal is to have similar objects grouped in the same cluster, and different objects placed in different clusters.

#####- Picture 1. Colourwheel -

# Empty vector to save results
n1<-c() 
# Number of clusters to consider
for (i in 1:10) {
  cl<-clara(rgb_colorwheel[, c("r.value", "g.value", "b.value")], i)
  # saving silhouette to vector
  n1[i]<-cl$silinfo$avg.width
}

plot(n1, type='l', main="Optimal number of clusters for Colourwheel", xlab="Number of clusters", ylab="Average silhouette", col="blue")
points(n1, pch=21, bg="navyblue")
abline(h=(1:30)*5/100, lty=3, col="grey50")

The graph above clearly indicates that 6 clusters should be selected (the highest silhouette index). The 9 and 10 also denotes high levels, but very similar to that of 6 clusters, so we rather should not choose them. It can be also valuable and helpful to check index value for each cluster.

clara1<-clara(rgb_colorwheel[,3:5], 6) 
plot(silhouette(clara1))

As can be observed on the graph above, the silhouette for each cluster takes values in the range of 0.49 to 1.00, making it a fairly accurate match, resulting in an overall silhouette index of 0.76. So now with that choice lets see the visualization of 6 clusters based on CLARA method.

colours<-rgb(clara1$medoids[clara1$clustering, ])
# plot pixels in the new colours
plot(rgb_colorwheel$y~rgb_colorwheel$x, col=colours, pch=".", cex=2, asp=1, main="6 colours")

Note that: As Colorwheel is jpg, not png format, there is also background which is white, and it’s one of the clusters.

The result is pretty good and accurate, but looking at the original picture I would suggest adding one more cluster to divide the wheel into 6 parts (despite it’s not what we would do based on silhouette score.

So lets find out how it would look like with 7 clusters:

I must say, I like that output more than previous one. However, one should remember that in that case the fit of point to cluster is a bit worse based on silhouette index.

I find that clustering went quite well on that example (even better than I expected). Looking at the visualization of the assigned clusters in the Colorwheel image, it is quite clear what color range was presented in the original picture. So lets move on to the second picture, which probably will be more challenging.

- Picture 2. Rainbow -

# Empty vector to save results
n2<-c() 
# Number of clusters to consider
for (i in 1:10) {
  c2<-clara(rgb_rainbow[, c("r.value", "g.value", "b.value")], i)
  # saving silhouette to vector
  n2[i]<-c2$silinfo$avg.width
}

plot(n2, type='l', main="Optimal number of clusters for Rainbow", xlab="Number of clusters", ylab="Average silhouette", col="blue")
points(n2, pch=21, bg="navyblue")
abline(h=(1:30)*5/100, lty=3, col="grey50")

The graph above showing optimal number of clusters suggest two possible choices - 4 or 7 clusters. So lets examine the clustering output in those two cases.

The plots below present silhouette index values for each cluster in both cases. The total silhouette is on a very similar level for 4 and for 7 clusters and equals approximately 0.6, which is quite good. In the first scenario silhouette for its clusters falls into range 0.58-0.68. That means all clusters have similar and quite high quality. However, when 7 clusters were introduced, some of them are not really well-fitted. As silhouette index varies from 0.29 (!) to 0.94, it seems like it probably might be a worse choice to make compering with 4 clusters.

clara1<-clara(rgb_rainbow[,3:5], 4) 
plot(silhouette(clara1))

clara2<-clara(rgb_rainbow[,3:5], 7) 
plot(silhouette(clara2))

The next step is the graphical visualization of both solutions. Below, the Rainbow image is presented after choosing 4 and then 7 clusters, respectively.

colours<-rgb(clara1$medoids[clara1$clustering, ])
# plot pixels in the new colours
plot(rgb_rainbow$y~rgb_rainbow$x, col=colours, pch=".", cex=2, asp=1, main="4 colours")

colours<-rgb(clara2$medoids[clara2$clustering, ])
# plot pixels in the new colours
plot(rgb_rainbow$y~rgb_rainbow$x, col=colours, pch=".", cex=2, asp=1, main="7 colours")

As can be observed, in the case of this image, clustering using the CLARA method, after choosing the appropriate number of clusters based on the interpretation of the silhouette index, did not handle the colorful rainbow well. Even increasing the number of clusters from 4 to 7 did not significantly improve the clarity of the image. The area of the photo representing the rainbow is represented by one/two colors, which does not correspond to the actual number of different colors present in it, and perceived by human eye.

Although the results obtained for the Colourwheel surprised me very positively, in the case of Rainbow, the results leave much to be desired. Perhaps the methods I used are not suitable for such colorful objects. Maybe a clearer picture with more contrast, showing only the rainbow without additional objects around it, would lead us to better output.

Nevertheless, I believe that the selected images allowed me to obtain interesting results (although the clustering results are not fully satisfying). Seeing other examples of image clustering, both during classes and found on the Internet, I had the impression that it is very easy in case of any picture, and the results always come out well-matched to the data and clear. That task helped me realise that thats not always the case.