Hue Knows Best: Classifying Paintings Through Color Clustering

Introduction

Art is often categorized by its stylistic elements, brushwork, and historical context, but can it be classified purely by color? This paper explores how unsupervised learning can be leveraged to analyze and cluster paintings based on their dominant color compositions. By extracting RGB values from works by Vincent van Gogh, Gustav Klimt, and Mark Rothko, this project employs clustering techniques to determine whether machine learning can correctly group paintings by artist based solely on color.

For this project, we assume that each painter has a distinct and consistent color palette. This assumption allows us to analyze whether color alone is sufficient to classify paintings correctly.

library(cluster)
library(factoextra)
library(imager)
library(jpeg)
library(rasterImage)
library(dplyr)
library(recolorize)

Data

For this project, I created a custom dataset. For each artist — Van Gogh, Klimt, and Rothko — I selected four paintings with similar color schemes, resulting in a total of 12 paintings.

Training set: 9 paintings (3 per artist) used to train the model. Test set: 3 paintings (1 per artist) used to evaluate the model on unseen data.

The first step was to organize the paintings by artist and split them into training and test sets.

rothko_paintings <- list.files(pattern = "Rothko_.*\\.jpg", full.names = TRUE)
klimt_paintings <- list.files(pattern = "Klimt_.*\\.jpg", full.names = TRUE)
vangogh_paintings <- list.files(pattern = "VanGogh_.*\\.jpg", full.names = TRUE)

training_set <- c(rothko_paintings[1:3], klimt_paintings[1:3], vangogh_paintings[1:3])
testing_set <- c(rothko_paintings[4], klimt_paintings[4], vangogh_paintings[4])

Next, I created a list of training images and used the recolorize function to extract and plot their color palettes. This visualization provided a clear representation of the dominant colors in each artist’s work.

image_paths <- c("VanGogh_1.jpg", "VanGogh_2.jpg", "VanGogh_3.jpg", 
                 "Rothko_1.jpg", "Rothko_2.jpg", "Rothko_3.jpg", 
                 "Klimt_1.jpg", "Klimt_2.jpg", "Klimt_3.jpg")

par(mfrow = c(3, 3), mar = c(1, 1, 1, 1))  

for (img_idx in seq_along(image_paths)) {

  palette <- recolorize(image_paths[img_idx], method = "k", n = 4, plotting = FALSE)
    plotColorPalette(palette$centers)
}

As expected, each painter exhibits a unique color palette:

Van Gogh - Blue-green hues
Mark Rothko - Reds and oranges
Gustav Klimt - Yellow and brown tones

To better understand the dataset, I plotted one of the images and checked its dimensions.

image1 <- readJPEG("VanGogh_1.jpg")
plot(1, type="n", main="Van Gogh painting 1")
rasterImage(image1, 0.6, 0.6, 1.4, 1.4)

dim1 <- dim(image1)
dim1

## [1] 200 200   3

Each painting is 200×200 pixels with three color layers (RGB). Since we need numerical data for clustering, I extracted RGB values from each image and removed the source labels for easier processing.

# Function to extract RGB data from a list of image files - generated with the help of AI

get_rgb_data <- function(image_files) {
  do.call(rbind, lapply(image_files, function(file_name) {
    image <- readJPEG(file_name)  
    data.frame(
      r.value = as.vector(image[,,1]), 
      g.value = as.vector(image[,,2]), 
      b.value = as.vector(image[,,3]), 
      source = basename(file_name)     
    )
  }))
}

training_rgb <- get_rgb_data(training_set)
testing_rgb <- get_rgb_data(testing_set)

training_rgb_no_source <- training_rgb[, c("r.value", "g.value", "b.value")]
testing_rgb_no_source <- testing_rgb[, c("r.value", "g.value", "b.value")]

Clustering with CLARA

CLARA (Clustering Large Applications) is a clustering algorithm designed for large datasets. While my dataset consists of only 12 paintings, each painting is 200×200 pixels, meaning there are 40,000 data points per painting — a significant amount of information.

CLARA is an extension of k-medoids, which is important because, unlike k-means, it selects actual data points as cluster centers.

I applied CLARA to the training set, specifying 3 clusters (one for each artist). The algorithm assigned each pixel in the dataset to a cluster, and each painting was then assigned to its most frequent cluster.

clara_model <- clara(training_rgb_no_source, k = 3, metric = "euclidean", samples = 3)

training_rgb$cluster <- clara_model$clustering

image_clusters <- training_rgb %>%
  group_by(source) %>%
  summarize(cluster = names(which.max(table(cluster))))  

image_clusters

## # A tibble: 9 × 2
##   source        cluster
##   <chr>         <chr>  
## 1 Klimt_1.jpg   2      
## 2 Klimt_2.jpg   2      
## 3 Klimt_3.jpg   2      
## 4 Rothko_1.jpg  1      
## 5 Rothko_2.jpg  1      
## 6 Rothko_3.jpg  1      
## 7 VanGogh_1.jpg 3      
## 8 VanGogh_2.jpg 3      
## 9 VanGogh_3.jpg 3

We can plot it to see all test paintings in their clusters.

par(mfrow = c(3, 3), mar = c(5,5,1,1))

for (i in 1:nrow(image_clusters)) {
  img <- readJPEG(image_clusters$source[i]) 
  plot(1, type = "n", main = paste("Cluster:", image_clusters$cluster[i]),
       xlab = "", ylab = "")
  rasterImage(img, 0.6, 0.6, 1.4, 1.4) 
}

When plotted, the results showed that:

Mark Rothko’s paintings were assigned to Cluster 1
Klimt’s paintings were assigned to Cluster 2
Van Gogh’s paintings were assigned to Cluster 3

This confirmed that CLARA was able to group paintings correctly based on their dominant colors.

Test data

Now, I prepared the test data — one unseen painting per artist

testing_images_paths <- c("VanGogh_4.jpg","Rothko_4.jpg", "Klimt_4.jpg")

par(mfrow = c(1, 3), mar = c(1, 1, 1, 1)) 

for (i in 1:length(testing_images_paths)) {
  img <- readJPEG(testing_images_paths[i]) 
  plot(1, type = "n", main = paste("Image:", testing_images_paths[i]),
       xlab = "", ylab = "", xaxt = 'n', yaxt = 'n')  
  rasterImage(img, 0.6, 0.6, 1.4, 1.4) 
}

Again, we observe the same distinct color schemes:

Van Gogh - Blue-green
Mark Rothko - Red
Klimt - Yellow-brown

To simplify the classification process, I calculated the mean RGB value for each test painting.

testing_mean_rgb <- testing_rgb %>%
  group_by(source) %>%
  summarise(
    mean_r = mean(r.value),
    mean_g = mean(g.value),
    mean_b = mean(b.value)
  )
testing_mean_rgb

## # A tibble: 3 × 4
##   source        mean_r mean_g mean_b
##   <chr>          <dbl>  <dbl>  <dbl>
## 1 Klimt_4.jpg    0.709  0.460 0.211 
## 2 Rothko_4.jpg   0.692  0.176 0.0848
## 3 VanGogh_4.jpg  0.398  0.521 0.546

Prediction Method

I initially considered using the predict() function from the flexclust package. However, it requires converting objects to the kcca class, which was not compatible with my data. Instead, I implemented a custom classification method by calculating the Euclidean distance between each test painting’s mean RGB values and the cluster centroids from CLARA.

cluster_centroids <- clara_model$medoids

#generated with help of AI
predicted_clusters <- sapply(1:nrow(testing_mean_rgb), function(i) {
  distances <- sapply(1:nrow(cluster_centroids), function(j) {
    sqrt(sum((testing_mean_rgb[i, c("mean_r", "mean_g", "mean_b")] - cluster_centroids[j, ])^2))
  })
  which.min(distances)
})

Let’s see what we got from the calculations:

testing_mean_rgb$predicted_cluster <- predicted_clusters
testing_mean_rgb

## # A tibble: 3 × 5
##   source        mean_r mean_g mean_b predicted_cluster
##   <chr>          <dbl>  <dbl>  <dbl>             <int>
## 1 Klimt_4.jpg    0.709  0.460 0.211                  2
## 2 Rothko_4.jpg   0.692  0.176 0.0848                 1
## 3 VanGogh_4.jpg  0.398  0.521 0.546                  3

Paintings got assigned to correct clusters - Rothko Cluster 1, Klimt Cluster 2, Van Gogh Cluster 3.

Lastly, I combined the training and test datasets and plotted all paintings together to visualize the final clusters.

testing_clusters <- testing_mean_rgb %>%
  select(source, predicted_cluster) %>%
  rename(cluster = predicted_cluster)

image_clusters$cluster <- as.character(image_clusters$cluster)
testing_clusters$cluster <- as.character(testing_clusters$cluster)


all_image_clusters <- bind_rows(image_clusters, testing_clusters)


all_image_clusters <- all_image_clusters %>%
  arrange(cluster)

par(mfrow = c(3, 4), mar = c(5, 5, 1, 1))  

for (i in 1:nrow(all_image_clusters)) {
  img <- readJPEG(all_image_clusters$source[i]) 
  plot(1, type = "n", main = paste("Cluster:", all_image_clusters$cluster[i]),
       xlab = "", ylab = "", xaxt = 'n', yaxt = 'n')
  rasterImage(img, 0.6, 0.6, 1.4, 1.4) 
}

Conclusion

This study explored whether paintings could be classified based solely on their color composition using unsupervised learning methods. By extracting RGB values from artworks by Vincent van Gogh, Gustav Klimt, and Mark Rothko, we applied the CLARA clustering algorithm to identify dominant color patterns. The results showed that the model successfully grouped paintings into clusters that corresponded to the original artists, suggesting that color alone can serve as a distinguishing feature in artistic classification.

However, it is important to note that this dataset was carefully curated — each artist was assigned a single dominant color palette. In reality, artists often experiment with diverse color schemes. For example, if we added Van Gogh’s “Sunflowers”, the model would likely fail, as it is predominantly yellow rather than blue-green.