Football Match Image Clustering and Data Reducing Analysis

This project aims to return the percentages of presences of the main colors in a football match image of the match Juventus - Real Madrid. After a data reduction due to the considerable size of the file, the process focuses on the use of K-Means and CLARA clustering algorithms, which have been crucial for achieving the final goal.

Libraries

After installing the appropriate packages, it was important to load the libraries with the functions used during the analysis.

library(png)
library(cluster)
library(ClusterR)
library(Rtsne)
library(grid)

Data Loading and Preprocessing

The first step was to load the raw image and convert it into a dataframe where every row represents a pixel with Red, Green, and Blue (RGB) values. In order to increase the velocity of computation, it was necessary a uniform downsampling.

img <- readPNG("C:/Users/milan/Downloads/USL/clusteringimage.png")
grid.raster(img)

# downsampling, as it is too large to be fully executed
imgRGB <- data.frame(
  R = as.vector(img[seq(1,dim(img)[1],by=2), seq(1,dim(img)[2],by=2), 1]),
  G = as.vector(img[seq(1,dim(img)[1],by=2), seq(1,dim(img)[2],by=2), 2]),
  B = as.vector(img[seq(1,dim(img)[1],by=2), seq(1,dim(img)[2],by=2), 3])
)
total_pixels <- nrow(imgRGB)
cat("Total pixels analyzed:", total_pixels, "\n\n")

## Total pixels analyzed: 85500

imgRGB_data <- imgRGB

Selection of the ideal number of Clusters

In this step it is determinated the most appropriate number of clusters (k) considering the data under analysis, and this can be done by studying the silhouette of each case. Since calculating distance matrices for the entire high-resolution image is computationally too expensive , for this situation it has been used a random representative sample of 5000 pixels. The code execution then iterated through potential k values ranging from 2 to 10, calculating the average silhouette width for each configuration. The value with the highest score was selected as the optimal parameter for the division. The optimal number of clusters obtained is 4, with a silhouette width of 0.6881.

set.seed(42)
sample_idx <- sample(1:nrow(imgRGB_data), 5000)
imgRGB_sample <- imgRGB_data[sample_idx, ]
silhouette_results <- data.frame()

for (k in 2:10) {
  km_temp <- kmeans(imgRGB_sample, centers=k, nstart=10, iter.max=100)
  sil <- mean(silhouette(km_temp$cluster, dist(imgRGB_sample))[, 3])
  silhouette_results <- rbind(silhouette_results, data.frame(k=k, sil=sil))
  cat(sprintf("k=%d: Silhouette=%.4f\n", k, sil))
}

## k=2: Silhouette=0.6396
## k=3: Silhouette=0.6489
## k=4: Silhouette=0.6881
## k=5: Silhouette=0.6729
## k=6: Silhouette=0.6458
## k=7: Silhouette=0.6447
## k=8: Silhouette=0.6130
## k=9: Silhouette=0.4220
## k=10: Silhouette=0.4179

optimal_k <- silhouette_results$k[which.max(silhouette_results$sil)]
cat("Optimal k found:", optimal_k, "\n\n")

## Optimal k found: 4

Dimensional reduction

The RGB data available can now be dimensionally reduced in order to have a satisfactive visualization, and for this it is used the PCA algorithm. The results of it shows that the first principal component (PC1) dominates the structure, with 74.8% of the variance. Combined with the second of 20.7%, these two dimensions achieve for over 95% of the total information. This suggests that the image can be effectively projected from 3D to 2D with almost no loss of visual details. This may be because of the dominant colors (such as the green pitch) that create a strong correlation between the RGB channels.

cat("\n PCA \n")

## 
##  PCA

pca <- prcomp(imgRGB_data, center=TRUE, scale.=FALSE)
var_exp <- (pca$sdev^2 / sum(pca$sdev^2)) * 100

cat(sprintf("PC1: %.1f%% variance\n", var_exp[1]))

## PC1: 74.8% variance

cat(sprintf("PC2: %.1f%% variance\n", var_exp[2]))

## PC2: 20.7% variance

Later, the use of t-SNE allows the detection of non linear relationships in the color space. At first it removes duplicate pixel colors from the dataset, considering that duplicates may cause errors and slow down the process. Once the data is cleaned, the code attempts to map the unique colors into a two-dimensional space. A safety mechanism (the ‘tryCatch’ block) is included in order to handle potential errors and ensuring the rest of the analysis continues even if the calculation fails for any reason. The goal is to divide in ‘groups’ similar colors based on their relationships, which should reveal distinct clusters better than the standard linear method.

cat("\n t-SNE \n")

## 
##  t-SNE

unique_indices <- which(!duplicated(imgRGB_data))
imgRGB_unique <- imgRGB_data[unique_indices, ]
tsne_success <- FALSE

tryCatch({
  tsne_result <- Rtsne(imgRGB_unique, dims=2, perplexity=30, verbose=FALSE, max_iter=1000)
  tsne_success <- TRUE
}, error = function(e) {
  cat("t-SNE error \n")
})

Clustering

At this point, the clustering process starts and there is a comparison between two methods: K-MEANS and CLARA (using the optimal number of clusters determined earlier). The first uses 25 different random starting points, considered a good compromise between time of execution and efficiency. The quality of the resulting ‘segmentation’ is evaluated by the mean of the silhouette score on a sample subset of the data. The more this value is close to 1, and the better it is.

The second one is more effective with large datasets than k-means, as it operates by analyzing small random samples of the data with the purpose to estimate the best clusters for the image, rather than processing every pixel at once. After finishing the grouping phase, the code validates the quality using the same silhouette score of before, allowing a direct comparison between the two methods.

The result is slightly different, as is in the order of 10^-3, but at the same time it shows that K-mean is more effective.

#clustering with k-means

k <- optimal_k
km <- kmeans(imgRGB_data, centers=k, nstart=25, iter.max=300)

km_sample_clusters <- km$cluster[sample_idx]
km_sil <- mean(silhouette(km_sample_clusters, dist(imgRGB_sample))[, 3])


# clustering with CLARA

clara <- clara(imgRGB_data, k=k, samples=50)

clara_sample_clusters <- clara$clustering[sample_idx]
clara_sil <- mean(silhouette(clara_sample_clusters, dist(imgRGB_sample))[, 3])

if (km_sil >= clara_sil) {
  best_algo <- "K-Means"
  centers <- km$centers
  clusters <- km$cluster
} else {
  best_algo <- "CLARA"
  centers <- clara$medoids
  clusters <- clara$clustering
}

cat(sprintf("Best algorithm: %s with a silhouette of %.4f)\n\n", best_algo, max(km_sil, clara_sil)))

## Best algorithm: K-Means with a silhouette of 0.6885)

Color interpretation

The final phase of the code focuses on interpretation and visualization of the results obtained. It goes through the center point of each identified cluster and looks at the specific mix of Red, Green, and Black (dark) values. The strategy is to first evaluate the color intensity of each cluster center, and then apply a set of simple rules to classify the results. By looking for specific ‘visual signs’ like the green level being stronger than the others (f the football pitch), or the total brightness being very low, the code automatically gives each group a clear name, like “GREEN,” “YELLOW,” or “BLACK” that are clearly the dominant colors of the image. After this process, the script calculates exactly how much of the image belongs to each color. To make the results look right, it finally links these text labels to the actual screen colors.

colors <- rep("", k)

for (i in 1:k) {
  r <- centers[i, 1]
  g <- centers[i, 2]
  b <- centers[i, 3]
  
  dominant_channel <- which.max(c(r, g, b))
  max_val <- max(r, g, b)
  min_val <- min(r, g, b)
  
  if (dominant_channel == 2 && g > 0.35 && (g - max(r, b)) > 0.05) {
    colors[i] <- "GREEN"
  } else if (r > 0.6 && g > 0.45 && b < 0.3) {
    colors[i] <- "YELLOW"
  } else if (max_val < 0.35) {
    colors[i] <- "BLACK"
  } else if (min_val > 0.6 && (max_val - min_val) < 0.1) {
    colors[i] <- "GRAY"
  } else if (min_val > 0.85) {
    colors[i] <- "WHITE"
  } else {
    colors[i] <- "OTHER"
  }
}

cluster_counts <- table(clusters)
pct <- (as.numeric(cluster_counts)/total_pixels)*100

Visualization

In this section there is the actual visualization of the plots. The first two plots are referred to the silhouette and confirms that 4 is the best number of clusters for the data. The third and fourth one show the PCA and t-SNE maps, from which it is clearly visible a separation between the colors. In the t-SNE plot, the black pixels form a dense and isolated cloud at the top (with just two little dots in the centre), completely separated from the gray region at the bottom. The green and yellow clusters are also holding their own distinct regions, despite being quite near to each another. The PCA projection graph shows a similar result, where the colors spread out in: black to the right, yellow and gray to the top and the bottom and the green in the middle; The last plot is a pie chart, and it summarizes the entire analysis woth a visualization of the precise percentage of the occupation of the colors.

par(mfrow=c(1,1), mar=c(5,4,4,2))

palette <- c("GREEN" = "#228B22","YELLOW" = "#FFD700","BLACK" = "#000000", "GRAY" = "#808080","WHITE" = "#FFFFFF","OTHER" = "#A9A9A9")

# Plot 1
plot(silhouette_results$k, silhouette_results$sil, type="b",main="Silhouette Analysis for Optimal K",xlab="Number of Clusters (K)", ylab="Average Silhouette Value",col="steelblue", pch=19, lwd=2)
abline(v=optimal_k, col="red", lty=2, lwd=2)

# Plot 2

sil_detailed <- silhouette(km_sample_clusters, dist(imgRGB_sample))

plot(sil_detailed, main=sprintf("Silhouette Plot - %s", best_algo),col=rainbow(k), border=NA)

# Plot 3
cluster_names_pca <- colors[clusters]             
plot_colors_pca   <- palette[cluster_names_pca]   

plot(pca$x[,1], pca$x[,2],col = plot_colors_pca, pch = 19, cex = 0.6, main = sprintf("PCA Projection (%s)", best_algo), xlab = sprintf("PC1 (%.1f%%)", var_exp[1]),ylab = sprintf("PC2 (%.1f%%)", var_exp[2]))

legend("topleft",legend = paste(colors, "(Cluster", 1:k, ")"),fill   = palette[colors],pch = 19, cex = 0.8, bty = "n")

# Plot 4
if (tsne_success) {
  tsne_clusters <- clusters[unique_indices]
  cluster_names <- colors[tsne_clusters] 
  plot_colors <- palette[cluster_names]
  
  plot(tsne_result$Y[,1], tsne_result$Y[,2], pch=19, cex=0.6,col = plot_colors,main="t-SNE graph",xlab="t-SNE Dimension 1", ylab="t-SNE Dimension 2")
  
  legend("topright",legend = paste(colors), fill = palette[colors],cex=0.8)
}

# Plot 5
pie(pct,labels=colors,main="Color Distribution in Image",col=sapply(colors, function(x) palette[[x]]), cex=0.8)

legend("topright",legend=paste(colors, sprintf("%.1f%%", pct)), fill=sapply(colors, function(x) palette[[x]]), cex=0.9)

Final summary

The last lines of the code show the specific number of pixels associated with each color, giving a more detailed information about the division of the clusters.

cat(sprintf("Optimal number of clusters: %d\n", optimal_k))

## Optimal number of clusters: 4

cat(sprintf("Best Algorithm: %s (Silhouette: %.4f)\n", best_algo, max(km_sil, clara_sil)))

## Best Algorithm: K-Means (Silhouette: 0.6885)

cat("\nCOLOR PERCENTAGES:\n")

## 
## COLOR PERCENTAGES:

for (i in 1:k) { cat(sprintf("%-10s: %6.2f%% (%7s pixels)\n",colors[i], pct[i], format(cluster_counts[i], big.mark=",")))
}

## BLACK     :  10.66% (  9,111 pixels)
## GREEN     :  74.12% ( 63,376 pixels)
## GRAY      :   7.18% (  6,135 pixels)
## YELLOW    :   8.04% (  6,878 pixels)

Football Match Image Clustering and Data Reducing Analysis

Unsupervised Learning Project

Luca Milan

2026-01-26