1 Introduction

This paper aims to analyze the work of one of the most outstanding painters in our society - Vincent Van Gogh. According to many opinions, his favourite colours were blue, yellow and green. Thanks to unsupervised learning methods for image clustering, I will be able to analyze his most popular paintings to find their dominant colours and check this hypothesis about three favourite colours. I will compare five Van Gogh’s paintings:
- ‘The Starry Night’,
- ‘Sunflowers’,
- ‘Blossoming Almond Tree’,
- ‘Irises’
- ‘The Café Terrace on the Place du Forum’.

2 Dataset - images

Thanks to ‘jpeg’ library we can import pictures of painting which we want to analyze.

library(jpeg)
Image1 <- readJPEG(source = "thestarrynight.jpg")
Image2 <- readJPEG(source = "sunflowers.jpg")
Image3 <- readJPEG(source = "blossomingalmondtree.jpg")
Image4 <- readJPEG(source = "irises.jpg")
Image5 <- readJPEG(source = "thecafe.jpg")

Now we can convert our images to RGB format which is the most comfortable in this situation. A word of introduction, RGB colour model is and additive colour model in which three colours - red, green and blue are mixed together in various ways to reproduce a broad array of colours. Its main purpose is sensing, representation and display of images - typically used in electronic systems and photography. Firstly, we want to check dimension of data set.

dm1 <- dim(Image1)
dm2 <- dim(Image2)
dm3 <- dim(Image3)
dm4 <- dim(Image4)
dm5 <- dim(Image5)

And check the results:

## [1] 599 756   3
## [1] 1281 1024    3
## [1] 2480 3139    3
## [1] 1467 1919    3
## [1] 1345 1024    3

After all we have to create data frames with given values:

rgbImage1<-data.frame(x=rep(1:dm1[2], each=dm1[1]),  y=rep(dm1[1]:1, dm1[2]), 
                      r.value=as.vector(Image1[,,1]),  g.value=as.vector(Image1[,,2]), 
                      b.value=as.vector(Image1[,,3]))
rgbImage2<-data.frame(x=rep(1:dm2[2], each=dm2[1]),  y=rep(dm2[1]:1, dm2[2]), 
                      r.value=as.vector(Image2[,,1]),  g.value=as.vector(Image2[,,2]), 
                      b.value=as.vector(Image2[,,3]))
rgbImage3<-data.frame(x=rep(1:dm3[2], each=dm3[1]),  y=rep(dm3[1]:1, dm3[2]), 
                      r.value=as.vector(Image3[,,1]),  g.value=as.vector(Image3[,,2]), 
                      b.value=as.vector(Image3[,,3]))
rgbImage4<-data.frame(x=rep(1:dm4[2], each=dm4[1]),  y=rep(dm4[1]:1, dm4[2]), 
                      r.value=as.vector(Image4[,,1]),  g.value=as.vector(Image4[,,2]), 
                      b.value=as.vector(Image4[,,3]))
rgbImage5<-data.frame(x=rep(1:dm5[2], each=dm5[1]),  y=rep(dm5[1]:1, dm5[2]), 
                      r.value=as.vector(Image5[,,1]),  g.value=as.vector(Image5[,,2]), 
                      b.value=as.vector(Image5[,,3]))

Finally, we can plot all five pictures pixel by pixel in colours from RGB:

plot(y~x, data=rgbImage1, main="The Starry Night", col=rgb(rgbImage1[c("r.value", "g.value", "b.value")]), 
     asp=1, pch=".")

plot(y~x, data=rgbImage2, main="Sunflowers", col=rgb(rgbImage2[c("r.value", "g.value", "b.value")]), 
     asp=1, pch=".")

plot(y~x, data=rgbImage3, main="Blossoming Almond Tree", 
     col=rgb(rgbImage3[c("r.value", "g.value", "b.value")]), asp=1, pch=".")

plot(y~x, data=rgbImage4, main="Irises", col=rgb(rgbImage4[c("r.value", "g.value", "b.value")]), 
     asp=1, pch=".")

plot(y~x, data=rgbImage5, main="The Café Terrace on the Place du Forum", 
     col=rgb(rgbImage5[c("r.value", "g.value", "b.value")]), asp=1, pch=".")

3 Clustering

3.1 Optimal number of k-clusters

Now we have five data sets with five columns and over several million observations in each data set. Because of sizes of data sets, I will use Clara algorithm from ‘cluster’ library which is convenient for analyzing large data sets. It is based on k-medoids PAM algorithm. We need to find the optimal number of k-clusters for each image by comparing average silhouette width for every k. Silhouette describes clustering consistency and ranges from -1 to 1. What’s important, we should choose the number of clusters with the highest value - the higher value of silhouette, the better clustering we have so we need some positive values.
First picture:

library(cluster)
n1<-c()
for (i in 1:10) {
  cl<-clara(rgbImage1[, c("r.value", "g.value", "b.value")], i) 
  n1[i]<-cl$silinfo$avg.width
}

plot(n1, type='l', main="Optimal number of clusters", xlab="Number of clusters", ylab="Average silhouette", 
     col="blue")


The graph shows that 2 clusters are optimal for ‘The Starry Night’.
Second picture:

n2<-c()
for (i in 1:10) {
  cl<-clara(rgbImage2[, c("r.value", "g.value", "b.value")], i) 
  n2[i]<-cl$silinfo$avg.width
}

plot(n2, type='l', main="Optimal number of clusters", xlab="Number of clusters", ylab="Average silhouette", 
     col="blue")


The graph shows that 2 clusters are optimal for ‘Sunflowers’. Because on this picture with have some darker and red elements, I decided to use 3 clusters.
Third picture:

n3<-c()
for (i in 1:10) {
  cl<-clara(rgbImage3[, c("r.value", "g.value", "b.value")], i) 
  n3[i]<-cl$silinfo$avg.width
}

plot(n3, type='l', main="Optimal number of clusters", xlab="Number of clusters", ylab="Average silhouette", 
     col="blue")


The graph shows that 3 clusters are optimal for ‘Blossoming Almond Tree’.
Fourth picture:

n4<-c()
for (i in 1:10) {
  cl<-clara(rgbImage4[, c("r.value", "g.value", "b.value")], i) 
  n4[i]<-cl$silinfo$avg.width
}

plot(n4, type='l', main="Optimal number of clusters", xlab="Number of clusters", ylab="Average silhouette", 
     col="blue")


The graph shows that 3 clusters are optimal for ‘Irises’.
Fifth picture:

n5<-c()
for (i in 1:10) {
  cl<-clara(rgbImage5[, c("r.value", "g.value", "b.value")], i) 
  n5[i]<-cl$silinfo$avg.width
}

plot(n5, type='l', main="Optimal number of clusters", xlab="Number of clusters", ylab="Average silhouette", 
     col="blue")


The graph shows that 2 clusters are optimal for ‘The Café Terrace on the Place du Forum’. Because of complicated images, as well as the use of many colours and shades, I decided to use 3 clusters during in further analysis.

3.2 CLARA algorithm

The next step in analysis is Clara algorithm. After running Clara algorithm I will generate the clustered images with colours from RGB values.
First picture:

thestarrynight <- rgbImage1[, c("r.value", "g.value", "b.value")]
clara1<-clara(thestarrynight, 2) 
plot(silhouette(clara1))

colours1<-rgb(clara1$medoids[clara1$clustering, ])
plot(y~x, data = rgbImage1, col=colours1, pch=".", asp=1, main="The Starry Night")


Second picture:

sunflowers <- rgbImage2[, c("r.value", "g.value", "b.value")]
clara2<-clara(sunflowers, 3) 
plot(silhouette(clara2))

colours2<-rgb(clara2$medoids[clara2$clustering, ])
plot(y~x, data = rgbImage2, col=colours2, pch=".", asp=1, main="Sunflowers")


Third picture:

blossomingalmondtree <- rgbImage3[, c("r.value", "g.value", "b.value")]
clara3<-clara(blossomingalmondtree, 3) 
plot(silhouette(clara3))

colours3<-rgb(clara3$medoids[clara3$clustering, ])
plot(y~x, data = rgbImage3, col=colours3, pch=".", asp=1, main="Blossoming Almond Tree")


Fourth picture:

irises <- rgbImage4[, c("r.value", "g.value", "b.value")]
clara4<-clara(irises, 3) 
plot(silhouette(clara4))

colours4<-rgb(clara4$medoids[clara4$clustering, ])
plot(y~x, data = rgbImage4, col=colours4, pch=".", asp=1, main="Irises")


Fifth picture:

thecafe <- rgbImage5[, c("r.value", "g.value", "b.value")]
clara5<-clara(thecafe, 3) 
plot(silhouette(clara5))

colours5<-rgb(clara5$medoids[clara5$clustering, ])
plot(y~x, data = rgbImage5, col=colours5, pch=".", asp=1, main="The Café Terrace on the Place du Forum")

3.3 Colours analysis

Last but not least, unsupervised learning methods enable us finding a dominant colours by clusters in each image. Firstly, we create data frame with names of colours selected during clustering, their frequency and label with the names of the images.

dominant1 <- setNames(cbind(as.data.frame(colours1), rep("The Starry Night", 
                                                         length(colours1))), c("Colours", "Image"))
dominant2 <- setNames(cbind(as.data.frame(colours2), rep("Sunflowers", 
                                                         length(colours2))), c("Colours", "Image"))
dominant3 <- setNames(cbind(as.data.frame(colours3), rep("Blossoming Almond Tree", 
                                                         length(colours3))), c("Colours", "Image"))
dominant4 <- setNames(cbind(as.data.frame(colours4), rep("Irises", 
                                                         length(colours4))), c("Colours", "Image"))
dominant5 <- setNames(cbind(as.data.frame(colours5), rep("The Café Terrace on the Place du Forum", 
                                                         length(colours5))), c("Colours", "Image"))
dominant <- rbind(dominant1, dominant2, dominant3, dominant4, dominant5)

Now we can use ‘ggplot2’ library to present colours distribution in given images by a bar plot:

library(ggplot2)
ggplot() + geom_bar(data = dominant, 
                    aes(x  = Image,
                        fill = Colours)) +
  scale_fill_manual(values = c("#283231","#2E3E6F", "#3B3F3E", "#5D7660", "#5E798E", 
                               "#6C8645", "#77A2A8", "#7EA392", "#855928", 
                               "#B2BDAC", "#B2BEA6", "#C79E35", "#CAA058"))


As we can see, Vincent Van Gogh used similar colours and he based on blue, green and yellow in different shades so hypothesis from the introduction has been confirmed. The most popular colour is green and its variants.

4 Conclusion

This paper shows possibilities in cluster analysis - we can analyze not only numerical data but also issues that are more difficult to capture, like the colours in pictures.