###Image Clustering
##1 Introduction In this paper I will present how to use unsupervised learning methods for image clustering. I found out that people on the social media like to post their pictures, and sometimes they like to blurized and simplifized their picture. The idea is to cluster an image and find its dominant color to represent the picture with less color. I will compare 5 pictures of my photo- “Mountain”, “Cat”, “Beach”, “City” and “Street”, which are taken by me.
##2 Data preprocessing First step, opening given images. To open .jpg file, JPEG library from R Documentation is needed.
library(jpeg)
# "Mountain", "Cat", "Beach", "City" and "Street"
image1 <- readJPEG("IMG_4982.JPG")
image2 <- readJPEG("IMG_0876.JPG")
image3 <- readJPEG("IMG_7479.JPG")
image4 <- readJPEG("IMG_1721.JPG")
image5 <- readJPEG("IMG_6934.JPG")
Second, can use this function: dim(image) to know what’s size of your image.
dm1 <- dim(image1);dm1[1:2]
## [1] 3088 2316
dm2 <- dim(image2);dm2[1:2]
## [1] 1108 1478
dm3 <- dim(image3);dm3[1:2]
## [1] 3088 2316
dm4 <- dim(image4);dm4[1:2]
## [1] 2944 2208
dm5 <- dim(image5);dm5[1:2]
## [1] 3840 2880
Size of the “Mountain” image is equal to 3088X2316 Size of the “Cat” image is equal to 1108X1478 Size of the “Beach” image is equal to 3088X2316 Size of the “City” image is equal to 2944X2208 Size of the “Street” image is equal to 3840X2880
Third step, we can transfer the format of the images from jpg to rgb and display them on the plot.
rgbImage1 <- data.frame(
x=rep(1:dm1[2], each=dm1[1]),
y=rep(dm1[1]:1, dm1[2]),
r.value=as.vector(image1[,,1]),
g.value=as.vector(image1[,,2]),
b.value=as.vector(image1[,,3]))
rgbImage2 <- data.frame(
x=rep(1:dm2[2], each=dm2[1]),
y=rep(dm2[1]:1, dm2[2]),
r.value=as.vector(image2[,,1]),
g.value=as.vector(image2[,,2]),
b.value=as.vector(image2[,,3]))
rgbImage3 <- data.frame(
x=rep(1:dm3[2], each=dm3[1]),
y=rep(dm3[1]:1, dm3[2]),
r.value=as.vector(image3[,,1]),
g.value=as.vector(image3[,,2]),
b.value=as.vector(image3[,,3]))
rgbImage4 <- data.frame(
x=rep(1:dm4[2], each=dm4[1]),
y=rep(dm4[1]:1, dm4[2]),
r.value=as.vector(image4[,,1]),
g.value=as.vector(image4[,,2]),
b.value=as.vector(image4[,,3]))
rgbImage5 <- data.frame(
x=rep(1:dm5[2], each=dm5[1]),
y=rep(dm5[1]:1, dm5[2]),
r.value=as.vector(image5[,,1]),
g.value=as.vector(image5[,,2]),
b.value=as.vector(image5[,,3]))
Last step of data processing, we are going to present our original images with frame.
plot(y ~ x, data=rgbImage1, main="Mountain",
col = rgb(rgbImage1[c("r.value", "g.value", "b.value")]),
asp = 1, pch = ".")
plot(y ~ x, data=rgbImage2, main="Cat",
col = rgb(rgbImage2[c("r.value", "g.value", "b.value")]),
asp = 1, pch = ".")
plot(y ~ x, data=rgbImage3, main="Beach",
col = rgb(rgbImage3[c("r.value", "g.value", "b.value")]),
asp = 1, pch = ".")
plot(y ~ x, data=rgbImage4, main="City",
col = rgb(rgbImage4[c("r.value", "g.value", "b.value")]),
asp = 1, pch = ".")
plot(y ~ x, data=rgbImage5, main="Street",
col = rgb(rgbImage5[c("r.value", "g.value", "b.value")]),
asp = 1, pch = ".")
##3 Clustering #3.1 Optimal number of k-clusters First of all, we need to install package of cluster: library(cluster). Then the important problem is we need to decide the number of clusters.
Now we know we have really huge size of data sets(Our smallest image is 1108X1478x3), so we will use Clara algorithm for image clustering which is based on k-medoids PAM algorithm and optimal for large data sets.
Finding out the optimal number of k-clusters for each image by comparing average silhouette width for every k. The range of Silhouette from -1 to 1. Typically, if the value is positive which means that the elements in cluster are correctly matched. Additionally, the higher value, the better clustering.
Our goal for today is reduce the color of those images, the perfect situation is choosing the higher value. If the value is close to 10 (6~10) is perfect because I prefer the more color then we will pick it. But the first value is in the range 2 to 5 is fine that we don’t need to culster again.
Using “cluster” library to run clara algorithm for 10 consecutive numbers to analyze the average silhouette width.
library(cluster)
#Mountain 9
n1 <- c()
for (i in 2:10) {
cl1 <- clara(rgbImage1, i)
print(paste('Average silhouette for',i,'clusters:',cl1$silinfo$avg.width))
n1[i] <- cl1$silinfo$avg.width
}
## [1] "Average silhouette for 2 clusters: 0.450381292035369"
## [1] "Average silhouette for 3 clusters: 0.41343672659383"
## [1] "Average silhouette for 4 clusters: 0.410918153117001"
## [1] "Average silhouette for 5 clusters: 0.450526971525648"
## [1] "Average silhouette for 6 clusters: 0.410930930328657"
## [1] "Average silhouette for 7 clusters: 0.395629548763213"
## [1] "Average silhouette for 8 clusters: 0.437905447729368"
## [1] "Average silhouette for 9 clusters: 0.453412244662813"
## [1] "Average silhouette for 10 clusters: 0.4366691587312"
plot(n1, type = 'l',
main = "Mountain",
xlab = "Number of clusters",
ylab = "Average silhouette",
col = "blue")
#Cat 9
n2 <- c()
for (i in 2:10) {
cl2 <- clara(rgbImage2, i)
print(paste('Average silhouette for',i,'clusters:',cl2$silinfo$avg.width))
n2[i] <- cl2$silinfo$avg.width
}
## [1] "Average silhouette for 2 clusters: 0.415471244276469"
## [1] "Average silhouette for 3 clusters: 0.393647714951608"
## [1] "Average silhouette for 4 clusters: 0.484598533091227"
## [1] "Average silhouette for 5 clusters: 0.480042694993103"
## [1] "Average silhouette for 6 clusters: 0.399557260759309"
## [1] "Average silhouette for 7 clusters: 0.414550785466934"
## [1] "Average silhouette for 8 clusters: 0.462839621991052"
## [1] "Average silhouette for 9 clusters: 0.438008305088433"
## [1] "Average silhouette for 10 clusters: 0.443680487596096"
plot(n2, type = 'l',
main = "Cat",
xlab = "Number of cluster",
ylab = "Average silhouette",
col = "blue")
#Beach 9
n3 <- c()
for (i in 2:10) {
cl3 <- clara(rgbImage3, i)
print(paste('Average silhouette for',i,'clusters:',cl3$silinfo$avg.width))
n3[i] <- cl3$silinfo$avg.width
}
## [1] "Average silhouette for 2 clusters: 0.450381289923871"
## [1] "Average silhouette for 3 clusters: 0.413436735468241"
## [1] "Average silhouette for 4 clusters: 0.410918136151174"
## [1] "Average silhouette for 5 clusters: 0.450526923095636"
## [1] "Average silhouette for 6 clusters: 0.410930777306704"
## [1] "Average silhouette for 7 clusters: 0.395629192383147"
## [1] "Average silhouette for 8 clusters: 0.437905109973589"
## [1] "Average silhouette for 9 clusters: 0.453411985141977"
## [1] "Average silhouette for 10 clusters: 0.436668691641044"
plot(n3, type = 'l',
main = "Beach",
xlab = "Number of cluster",
ylab = "Average silhouette",
col = "blue")
#City 6
n4 <- c()
for (i in 2:10) {
cl4 <- clara(rgbImage4, i)
print(paste('Average silhouette for',i,'clusters:',cl4$silinfo$avg.width))
n4[i] <- cl4$silinfo$avg.width
}
## [1] "Average silhouette for 2 clusters: 0.437445923816954"
## [1] "Average silhouette for 3 clusters: 0.430596750321306"
## [1] "Average silhouette for 4 clusters: 0.521382325059271"
## [1] "Average silhouette for 5 clusters: 0.490634403200633"
## [1] "Average silhouette for 6 clusters: 0.481469631961961"
## [1] "Average silhouette for 7 clusters: 0.441929146142101"
## [1] "Average silhouette for 8 clusters: 0.505888419816954"
## [1] "Average silhouette for 9 clusters: 0.50992734468438"
## [1] "Average silhouette for 10 clusters: 0.484028369679495"
plot(n4, type = 'l',
main = "City",
xlab = "Number of cluster",
ylab = "Average silhouette",
col = "blue")
#Street 4
n5 <- c()
for (i in 2:10) {
cl5 <- clara(rgbImage5, i)
print(paste('Average silhouette for',i,'clusters:',cl5$silinfo$avg.width))
n5[i] <- cl5$silinfo$avg.width
}
## [1] "Average silhouette for 2 clusters: 0.423491472075399"
## [1] "Average silhouette for 3 clusters: 0.389079789027534"
## [1] "Average silhouette for 4 clusters: 0.444918190990254"
## [1] "Average silhouette for 5 clusters: 0.435595094720001"
## [1] "Average silhouette for 6 clusters: 0.383054153654223"
## [1] "Average silhouette for 7 clusters: 0.420505573077479"
## [1] "Average silhouette for 8 clusters: 0.386852172440343"
## [1] "Average silhouette for 9 clusters: 0.437575966202725"
## [1] "Average silhouette for 10 clusters: 0.419061757009078"
plot(n5, type = 'l',
main = "Street",
xlab = "Number of cluster",
ylab = "Average silhouette",
col = "blue")
Optimal number of clusters: Mountain: 9, Cat: 4, Beach: 9, City: 6, Street: 4 As there are 3 basic colors (red, blue and green) and average silhouette width is around 0.5. For the first four picture, I will choose 9, 6 and 4 clusters respectively, in order to make the color analysis more diverse.
#3.2 Running Clara algorithm
This step is running Clara algorithm with given number of cluster Let’s see the clustered image on the plot with the usage of rgb() function, which creates colors from rgb values.
“Mountain” image:
mountain = rgbImage1[, c("r.value", "g.value", "b.value")]
clara1 <- clara(mountain, 9)
plot(silhouette(clara1))
colours1 <- rgb(clara1$medoids[clara1$clustering, ])
plot(y ~ x, data=rgbImage1, main="Mountain",
col = colours1,
asp = 1, pch = ".")
“Cat” image:
cat = rgbImage2[, c("r.value", "g.value", "b.value")]
clara2 <- clara(cat, 4)
plot(silhouette(clara2))
colours2 <- rgb(clara2$medoids[clara2$clustering, ])
plot(y ~ x, data=rgbImage2, main="Cat",
col = colours2,
asp = 1, pch = ".")
“Beach” image:
beach = rgbImage3[, c("r.value", "g.value", "b.value")]
clara3 <- clara(beach, 9)
plot(silhouette(clara3))
colours3 <- rgb(clara3$medoids[clara3$clustering, ])
plot(y ~ x, data=rgbImage3, main="Beach",
col = colours3,
asp = 1, pch = ".")
“City” image:
city = rgbImage4[, c("r.value", "g.value", "b.value")]
clara4 <- clara(city, 4)
plot(silhouette(clara4))
colours4 <- rgb(clara4$medoids[clara4$clustering, ])
plot(y ~ x, data=rgbImage4, main="City",
col = colours4,
asp = 1, pch = ".")
“Street” image:
street = rgbImage5[, c("r.value", "g.value", "b.value")]
clara5 <- clara(street, 4)
plot(silhouette(clara5))
colours5 <- rgb(clara5$medoids[clara5$clustering, ])
plot(y ~ x, data=rgbImage5, main="Street",
col = colours5,
asp = 1, pch = ".")
#3.3 Finding dominant colour Finally, the last step, unsupervised learning methods could find dominant colors by clusters in each image. Firstly, we create data frame with names of colors selected during clustering, their frequency and label with the names of the images.
Counting the percentage color distribution. Value of the color frequency column is simply a size of the clusters.
We use bar chart to present the distribution as below:
dominant1 <- setNames(cbind(as.data.frame(colours1), rep("Mountain", length(colours1))), c("Colours", "Image"))
dominant2 <- setNames(cbind(as.data.frame(colours2), rep("Cat", length(colours2))), c("Colours", "Image"))
dominant3 <- setNames(cbind(as.data.frame(colours3), rep("Beach", length(colours3))), c("Colours", "Image"))
dominant4 <- setNames(cbind(as.data.frame(colours4),rep("City", length(colours4))), c("Colours", "Image"))
dominant5 <- setNames(cbind(as.data.frame(colours5), rep("Street", length(colours5))), c("Colours", "Image"))
dominant <- rbind(dominant1, dominant2, dominant3, dominant4, dominant5)
First we install package: library(ggplot2), then we use ggplot to show the color distribution:
library(ggplot2)
ggplot() + geom_bar(data = dominant,
aes(x = Image,
fill = Colours)) +
scale_fill_manual(values = c( "#10100E", "#161A23", "#1F1715", "#23232B", "#27160F",
"#2B2321", "#36372C", "#37484F", "#514748", "#566464",
"#655466", "#6D86AE", "#765A45", "#798385", "#8397BA",
"#887B72", "#8D7362", "#97ACCB", "#AFB0B2", "#B07861",
"#B3C3DC", "#B4B9BD", "#BEAEA1", "#C3B29E", "#D3BA9B",
"#D8E4F2", "#DBA57F", "#EDE6DC", "#F09B27", "#FBFDFC",
"#C1AB94", "#DAD4BE", "#DAD4BE","#E5B091", "#FBFDFC"))
##4 Conclusion
This paper shows application in cluster analysis - we can not only analyze numerical data but also create funny pictures that inspire more poeple to create their own special photos.