###Image Clustering
##1 Introduction In this paper I will present how to use unsupervised learning methods for image clustering. I found out that people on the social media like to post their pictures, and sometimes they like to cartoonized or blurized their picture. The idea is to cluster an image and find its dominant color to represent the picture with less color. I will compare 5 pictures of my photo- “Mountain”, “House”, “Beach”, “City” and “Street”, which are taken by me.
##2 Data preprocessing First step, opening given images. To open .jpg file, JPEG library from R Documentation is needed.
library(jpeg)
# “Mountain”, “House”, "Beach", "City" and "Street"
image1 <- readJPEG("IMG_4982.JPG")
image4 <- readJPEG("IMG_1721.JPG")
Second, can use this function: dim(image), to know what’s size of your image.
dm1 <- dim(image1);dm1[1:2]
## [1] 3088 2316
dm4 <- dim(image4);dm4[1:2]
## [1] 2944 2208
Size of the “Mountain” image is equal to 3088X2316 Size of the “House” image is equal to 3088X2316 Size of the “Beach” image is equal to 3088X2316 Size of the “City” image is equal to 3088X2316 Size of the “Street” image is equal to 3088X2316
Third step, we can transfer the format of the images from jpg to rgb and display them on the plot.
rgbImage1 <- data.frame(
x=rep(1:dm1[2], each=dm1[1]),
y=rep(dm1[1]:1, dm1[2]),
r.value=as.vector(image1[,,1]),
g.value=as.vector(image1[,,2]),
b.value=as.vector(image1[,,3]))
rgbImage4 <- data.frame(
x=rep(1:dm4[2], each=dm4[1]),
y=rep(dm4[1]:1, dm4[2]),
r.value=as.vector(image4[,,1]),
g.value=as.vector(image4[,,2]),
b.value=as.vector(image4[,,3]))
Last step of data processing, we are going to present our original images with frame.
plot(y ~ x, data=rgbImage1, main="Mountain",
col = rgb(rgbImage1[c("r.value", "g.value", "b.value")]),
asp = 1, pch = ".")
plot(y ~ x, data=rgbImage4, main="City",
col = rgb(rgbImage4[c("r.value", "g.value", "b.value")]),
asp = 1, pch = ".")
##3 Clustering #3.1 Optimal number of k-clusters First of all, we need to install package of cluster: library(cluster). Then the important problem is we need to decide the number of clusters.
Now we know we have really huge size of data sets(Our image are 3088x2316x3), so we will use Clara algorithm for image clustering which is based on k-medoids PAM algorithm and optimal for large data sets.
Finding out the optimal number of k-clusters for each image by comparing average silhouette width for every k. The range of Silhouette from -1 to 1. Typically, if the value is positive which means that the elements in cluster are correctly matched. Additionally, the higher value, the better clustering.
But our goal for today is reduce the color or those images, the perfect situation is choosing the higher value. If the value is close to 10 (6~10) pick it first and cluster it again and default the value with 3. If the first value is in the range 2 to 5 then don’t need to culster again.
Using “cluster” library to run clara algorithm for 10 consecutive numbers to analyze the average silhouette width.
#Mountain 9
library(cluster)
n1 <- c()
for (i in 2:10) {
cl1 <- clara(rgbImage1, i)
print(paste('Average silhouette for',i,'clusters:',cl1$silinfo$avg.width))
n1[i] <- cl1$silinfo$avg.width
}
## [1] "Average silhouette for 2 clusters: 0.450381292035369"
## [1] "Average silhouette for 3 clusters: 0.41343672659383"
## [1] "Average silhouette for 4 clusters: 0.410918153117001"
## [1] "Average silhouette for 5 clusters: 0.450526971525648"
## [1] "Average silhouette for 6 clusters: 0.410930930328657"
## [1] "Average silhouette for 7 clusters: 0.395629548763213"
## [1] "Average silhouette for 8 clusters: 0.437905447729368"
## [1] "Average silhouette for 9 clusters: 0.453412244662813"
## [1] "Average silhouette for 10 clusters: 0.4366691587312"
plot(n1, type = 'l',
main = "Mountain",
xlab = "Number of clusters",
ylab = "Average silhouette",
col = "blue")
#City 6
n4 <- c()
for (i in 2:10) {
cl4 <- clara(rgbImage4, i)
print(paste('Average silhouette for',i,'clusters:',cl4$silinfo$avg.width))
n4[i] <- cl4$silinfo$avg.width
}
## [1] "Average silhouette for 2 clusters: 0.437445923816954"
## [1] "Average silhouette for 3 clusters: 0.430596750321306"
## [1] "Average silhouette for 4 clusters: 0.521382325059271"
## [1] "Average silhouette for 5 clusters: 0.490634403200633"
## [1] "Average silhouette for 6 clusters: 0.481469631961961"
## [1] "Average silhouette for 7 clusters: 0.441929146142101"
## [1] "Average silhouette for 8 clusters: 0.505888419816954"
## [1] "Average silhouette for 9 clusters: 0.50992734468438"
## [1] "Average silhouette for 10 clusters: 0.484028369679495"
plot(n4, type = 'l',
main = "City",
xlab = "Number of cluster",
ylab = "Average silhouette",
col = "blue")
Optimal number of clusters: Mountain: 9, House: 9, Beach: 9, City: 6, Street: 4 As there are 3 basic colors (red, blue and green) and average silhouette width is around 0.5. For the first four picture, I will choose 9 and 6 clusters respectively, in order to make the color analysis more diverse. Then do it again. For the last picture we will choose 4 clusters.
##4 Running Clara algorithm
This step is running Clara algorithm with given number of cluster Let’s see the clustered image on the plot with the usage of rgb() function, which creates colors from rgb values.
“Mountain” image:
mountain = rgbImage1[, c("r.value", "g.value", "b.value")]
clara1 <- clara(mountain, 9)
plot(silhouette(clara1))
colours1 <- rgb(clara1$medoids[clara1$clustering, ])
plot(y ~ x, data=rgbImage1, main="Mountain",
col = colours1,
asp = 1, pch = ".")
“City” image:
city = rgbImage4[, c("r.value", "g.value", "b.value")]
clara4 <- clara(city, 6)
plot(silhouette(clara4))
colours4 <- rgb(clara4$medoids[clara4$clustering, ])
plot(y ~ x, data=rgbImage4, main="City",
col = colours4,
asp = 1, pch = ".")
#3.3 Finding dominant colour Finally, the last step, unsupervised learning methods could find dominant colors by clusters in each image. Firstly, we create data frame with names of colors selected during clustering, their frequency and label with the names of the images.
Counting the percentage color distribution. Value of the color frequency column is simply a size of the clusters.
We use bar chart to present the distribution as below:
dominant1 <- setNames(cbind(as.data.frame(colours1), rep("Mountain", length(colours1))), c("Colours", "Image"))
dominant4 <- setNames(cbind(as.data.frame(colours4), rep("City", length(colours4))), c("Colours", "Image"))
dominant <- rbind(dominant1, dominant4)
First we install package: library(ggplot2), then we use ggplot to show the color distribution:
library(ggplot2)
ggplot() + geom_bar(data = dominant,
aes(x = Image,
fill = Colours)) +
scale_fill_manual(values = c( "#140F09", "#191919", "#20252B", "#30151C",
"#302521", "#4A4B50", "#7690B5", "#837C76",
"#847A78", "#A7C6F2", "#B0AD9C", "#B46E66",
"#B8DCFC", "#BDC3D3", "#BECDE0", "#2E3E6F",
"#3B3F3E", "#5D7660", "#5E798E", "#6C8645",
"#77A2A8", "#7EA392", "#855928"))