Art is often categorized by its stylistic elements, brushwork, and historical context, but can it be classified purely by color? This paper explores how unsupervised learning can be leveraged to analyze and cluster paintings based on their dominant color compositions. By extracting RGB values from works by Vincent van Gogh, Gustav Klimt, and Mark Rothko, this project employs clustering techniques to determine whether machine learning can correctly group paintings by artist based solely on color.
For this project, we assume that each painter has a distinct and consistent color palette. This assumption allows us to analyze whether color alone is sufficient to classify paintings correctly.
library(cluster)
library(factoextra)
library(imager)
library(jpeg)
library(rasterImage)
library(dplyr)
library(recolorize)
For this project, I created a custom dataset. For each artist — Van Gogh, Klimt, and Rothko — I selected four paintings with similar color schemes, resulting in a total of 12 paintings.
Training set: 9 paintings (3 per artist) used to train the model. Test set: 3 paintings (1 per artist) used to evaluate the model on unseen data.
The first step was to organize the paintings by artist and split them into training and test sets.
rothko_paintings <- list.files(pattern = "Rothko_.*\\.jpg", full.names = TRUE)
klimt_paintings <- list.files(pattern = "Klimt_.*\\.jpg", full.names = TRUE)
vangogh_paintings <- list.files(pattern = "VanGogh_.*\\.jpg", full.names = TRUE)
training_set <- c(rothko_paintings[1:3], klimt_paintings[1:3], vangogh_paintings[1:3])
testing_set <- c(rothko_paintings[4], klimt_paintings[4], vangogh_paintings[4])
Next, I created a list of training images and used the recolorize function to extract and plot their color palettes. This visualization provided a clear representation of the dominant colors in each artist’s work.
image_paths <- c("VanGogh_1.jpg", "VanGogh_2.jpg", "VanGogh_3.jpg",
"Rothko_1.jpg", "Rothko_2.jpg", "Rothko_3.jpg",
"Klimt_1.jpg", "Klimt_2.jpg", "Klimt_3.jpg")
par(mfrow = c(3, 3), mar = c(1, 1, 1, 1))
for (img_idx in seq_along(image_paths)) {
palette <- recolorize(image_paths[img_idx], method = "k", n = 4, plotting = FALSE)
plotColorPalette(palette$centers)
}
As expected, each painter exhibits a unique color palette:
To better understand the dataset, I plotted one of the images and checked its dimensions.
image1 <- readJPEG("VanGogh_1.jpg")
plot(1, type="n", main="Van Gogh painting 1")
rasterImage(image1, 0.6, 0.6, 1.4, 1.4)
dim1 <- dim(image1)
dim1
## [1] 200 200 3
Each painting is 200×200 pixels with three color layers (RGB). Since we need numerical data for clustering, I extracted RGB values from each image and removed the source labels for easier processing.
# Function to extract RGB data from a list of image files - generated with the help of AI
get_rgb_data <- function(image_files) {
do.call(rbind, lapply(image_files, function(file_name) {
image <- readJPEG(file_name)
data.frame(
r.value = as.vector(image[,,1]),
g.value = as.vector(image[,,2]),
b.value = as.vector(image[,,3]),
source = basename(file_name)
)
}))
}
training_rgb <- get_rgb_data(training_set)
testing_rgb <- get_rgb_data(testing_set)
training_rgb_no_source <- training_rgb[, c("r.value", "g.value", "b.value")]
testing_rgb_no_source <- testing_rgb[, c("r.value", "g.value", "b.value")]
CLARA (Clustering Large Applications) is a clustering algorithm designed for large datasets. While my dataset consists of only 12 paintings, each painting is 200×200 pixels, meaning there are 40,000 data points per painting — a significant amount of information.
CLARA is an extension of k-medoids, which is important because, unlike k-means, it selects actual data points as cluster centers.
I applied CLARA to the training set, specifying 3 clusters (one for each artist). The algorithm assigned each pixel in the dataset to a cluster, and each painting was then assigned to its most frequent cluster.
clara_model <- clara(training_rgb_no_source, k = 3, metric = "euclidean", samples = 3)
training_rgb$cluster <- clara_model$clustering
image_clusters <- training_rgb %>%
group_by(source) %>%
summarize(cluster = names(which.max(table(cluster))))
image_clusters
## # A tibble: 9 × 2
## source cluster
## <chr> <chr>
## 1 Klimt_1.jpg 2
## 2 Klimt_2.jpg 2
## 3 Klimt_3.jpg 2
## 4 Rothko_1.jpg 1
## 5 Rothko_2.jpg 1
## 6 Rothko_3.jpg 1
## 7 VanGogh_1.jpg 3
## 8 VanGogh_2.jpg 3
## 9 VanGogh_3.jpg 3
We can plot it to see all test paintings in their clusters.
par(mfrow = c(3, 3), mar = c(5,5,1,1))
for (i in 1:nrow(image_clusters)) {
img <- readJPEG(image_clusters$source[i])
plot(1, type = "n", main = paste("Cluster:", image_clusters$cluster[i]),
xlab = "", ylab = "")
rasterImage(img, 0.6, 0.6, 1.4, 1.4)
}
When plotted, the results showed that:
This confirmed that CLARA was able to group paintings correctly based on their dominant colors.
Now, I prepared the test data — one unseen painting per artist
testing_images_paths <- c("VanGogh_4.jpg","Rothko_4.jpg", "Klimt_4.jpg")
par(mfrow = c(1, 3), mar = c(1, 1, 1, 1))
for (i in 1:length(testing_images_paths)) {
img <- readJPEG(testing_images_paths[i])
plot(1, type = "n", main = paste("Image:", testing_images_paths[i]),
xlab = "", ylab = "", xaxt = 'n', yaxt = 'n')
rasterImage(img, 0.6, 0.6, 1.4, 1.4)
}
Again, we observe the same distinct color schemes:
To simplify the classification process, I calculated the mean RGB value for each test painting.
testing_mean_rgb <- testing_rgb %>%
group_by(source) %>%
summarise(
mean_r = mean(r.value),
mean_g = mean(g.value),
mean_b = mean(b.value)
)
testing_mean_rgb
## # A tibble: 3 × 4
## source mean_r mean_g mean_b
## <chr> <dbl> <dbl> <dbl>
## 1 Klimt_4.jpg 0.709 0.460 0.211
## 2 Rothko_4.jpg 0.692 0.176 0.0848
## 3 VanGogh_4.jpg 0.398 0.521 0.546
I initially considered using the predict() function from the flexclust package. However, it requires converting objects to the kcca class, which was not compatible with my data. Instead, I implemented a custom classification method by calculating the Euclidean distance between each test painting’s mean RGB values and the cluster centroids from CLARA.
cluster_centroids <- clara_model$medoids
#generated with help of AI
predicted_clusters <- sapply(1:nrow(testing_mean_rgb), function(i) {
distances <- sapply(1:nrow(cluster_centroids), function(j) {
sqrt(sum((testing_mean_rgb[i, c("mean_r", "mean_g", "mean_b")] - cluster_centroids[j, ])^2))
})
which.min(distances)
})
Let’s see what we got from the calculations:
testing_mean_rgb$predicted_cluster <- predicted_clusters
testing_mean_rgb
## # A tibble: 3 × 5
## source mean_r mean_g mean_b predicted_cluster
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Klimt_4.jpg 0.709 0.460 0.211 2
## 2 Rothko_4.jpg 0.692 0.176 0.0848 1
## 3 VanGogh_4.jpg 0.398 0.521 0.546 3
Paintings got assigned to correct clusters - Rothko Cluster 1, Klimt Cluster 2, Van Gogh Cluster 3.
Lastly, I combined the training and test datasets and plotted all paintings together to visualize the final clusters.
testing_clusters <- testing_mean_rgb %>%
select(source, predicted_cluster) %>%
rename(cluster = predicted_cluster)
image_clusters$cluster <- as.character(image_clusters$cluster)
testing_clusters$cluster <- as.character(testing_clusters$cluster)
all_image_clusters <- bind_rows(image_clusters, testing_clusters)
all_image_clusters <- all_image_clusters %>%
arrange(cluster)
par(mfrow = c(3, 4), mar = c(5, 5, 1, 1))
for (i in 1:nrow(all_image_clusters)) {
img <- readJPEG(all_image_clusters$source[i])
plot(1, type = "n", main = paste("Cluster:", all_image_clusters$cluster[i]),
xlab = "", ylab = "", xaxt = 'n', yaxt = 'n')
rasterImage(img, 0.6, 0.6, 1.4, 1.4)
}
This study explored whether paintings could be classified based solely on their color composition using unsupervised learning methods. By extracting RGB values from artworks by Vincent van Gogh, Gustav Klimt, and Mark Rothko, we applied the CLARA clustering algorithm to identify dominant color patterns. The results showed that the model successfully grouped paintings into clusters that corresponded to the original artists, suggesting that color alone can serve as a distinguishing feature in artistic classification.
However, it is important to note that this dataset was carefully curated — each artist was assigned a single dominant color palette. In reality, artists often experiment with diverse color schemes. For example, if we added Van Gogh’s “Sunflowers”, the model would likely fail, as it is predominantly yellow rather than blue-green.