Abstract

This study looks at how computers can group satellite images without labels. I used the EuroSAT dataset for my tests. I took a random sample of 2,000 images and used the CLARA algorithm on them. I only looked at grayscale pixel brightness. I also compared two ways to simplify the data: PCA and t-SNE. PCA draws straight lines, but t-SNE is more flexible. My results show that t-SNE is much better at showing the real shapes in land-use data.

EuroSAT Dataset

First, I need to find all the image files. I used the EuroSAT dataset. It contains many satellite photos of different land types. I also set a “seed” here. This makes sure my random choices remain the same if I run the code again.

euroSAT <- list.files("EuroSAT_RGB", pattern = "\\.jpg$", full.names = TRUE, recursive = TRUE)
set.seed(123)

Sampling and processing the images

Processing every single image takes too long. So, I picked 2,000 images at random. I also converted them to grayscale. This means I averaged the Red, Green, and Blue colors into one brightness value.

After that, I flattened the images into simple lists of numbers. I also “scaled” the data. This standardizes the numbers so that very bright pixels don’t confuse the algorithm.

sample_files <- sample(euroSAT, 2000)

process_image <- function(filepath) {
  img <- readJPEG(filepath) 
  
  # Converting to grayscale
  if (length(dim(img)) == 3) {
    img <- (img[,,1] + img[,,2] + img[,,3]) / 3
  }
  
  as.vector(img)
}
image_matrix <- t(sapply(sample_files, process_image))
df <- as.data.frame(image_matrix)
true_labels <- sapply(strsplit(sample_files, "/"), function(x) x[length(x)-1])

table(true_labels)

## true_labels
##           AnnualCrop               Forest HerbaceousVegetation 
##                  210                  228                  217 
##              Highway           Industrial              Pasture 
##                  207                  152                  159 
##        PermanentCrop          Residential                River 
##                  181                  232                  172 
##              SeaLake 
##                  242

df_scaled <- scale(df)
dim(df_scaled)

## [1] 2000 4096

Assessing clustrability

Before I start, I should check if the data actually has groups. Sometimes data is just random noise. I used the Hopkins Statistic to check this. If the value is close to 1, the data has good clusters. As we can see, it is very much clustarable.

get_clust_tendency(df_scaled, n = 50, graph = FALSE)$hopkins_stat

## [1] 0.8141851

Clara clustering

I have a lot of data, so normal clustering methods are too slow. Instead, I used CLARA. It is a faster version of the standard “k-medoids” method. It takes small samples of the data to find the best groups.

First, I checked how many clusters I should look for. The “Silhouette Method” suggested that 2 clusters was the best choice. Then, I ran the algorithm to split the images into two groups.

fviz_nbclust(df_scaled[1:500, ], clara, method = "silhouette") + 
  labs(subtitle = "Silhouette Method")

clara_flex <- eclust(df_scaled, "clara", k = 2, graph = FALSE)
fviz_cluster(clara_flex, geom = "point")

fviz_silhouette(clara_flex)

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

##   cluster size ave.sil.width
## 1       1 1035          0.12
## 2       2  965          0.59

Cluster Purity Analysis

Now I want to see what is inside my clusters. I compared my computer-generated groups with the real labels (like Forest, Highway, or River). This is called “purity analysis.”

The plot below shows the mix. Cluster 1 is messy and has many man-made things like highways and cities. Cluster 2 is cleaner and mostly contains nature, like water and forests.

conf_mat <- table(Cluster = clara_flex$clustering, Truth = true_labels)
conf_df <- as.data.frame(conf_mat)

land_use_colors <- c(
  "Forest" = "forestgreen",
  "SeaLake" = "dodgerblue3", 
  "Industrial" = "firebrick",
  "Highway" = "darkgrey",
  "River" = "blue",
  "Pasture" = "lawngreen",
  "Residential" = "purple",
  "AnnualCrop" = "gold",
  "HerbaceousVegetation" = "yellowgreen",
  "PermanentCrop" = "chartreuse4"
)

ggplot(conf_df, aes(x = factor(Cluster), y = Freq, fill = Truth)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Cluster Purity Analysis",
       x = "Cluster ID",
       y = "Proportion of Land Use Types",
       fill = "True Label") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent) +
  
  scale_fill_manual(values = land_use_colors)

## Dimension Reduction - PCA It is hard to visualize data with thousands of dimensions. PCA helps squash this down to 2 dimensions. It finds the “directions” where the data varies the most.

The plot below shows the result. The groups look like two big blobs that overlap a lot. This happens because PCA is a linear method. It tries to draw straight lines through complicated data.

pca_res <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
fviz_eig(pca_res, addlabels = TRUE, main = "PCA: Variance Explained")

## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

fviz_pca_ind(pca_res,
             geom.ind = "point",
             pointsize = 1,
             col.ind = as.factor(clara_flex$clustering),
             palette = "jco",
             addEllipses = TRUE,
             legend.title = "Cluster",
             title = "PCA Result (Linear Reduction)")

## Ignoring unknown labels:
## • linetype : "Cluster"

## Dimension Reduction - t-SNE Since PCA struggled, I tried t-SNE. This is a non-linear method. It looks at the “neighbors” of each point instead of global variance. It is much better at unfolding complex shapes.

The result is very different. You can see a curved shape, almost like a horseshoe. This shows that the land types flow into each other. They don’t just sit in separate boxes.

df_unique <- unique(df_scaled)
tsne_out <- Rtsne(df_unique, dims = 2, pca = TRUE, perplexity = 30, verbose = FALSE)
tsne_plot <- data.frame(x = tsne_out$Y[,1],
                        y = tsne_out$Y[,2],
                        Cluster = as.factor(clara_flex$clustering))

ggplot(tsne_plot, aes(x = x, y = y, color = Cluster)) +
  geom_point(alpha = 0.7) +
  labs(title = "t-SNE Visualization (Non-Linear)",
       x = "t-SNE Dimension 1",
       y = "t-SNE Dimension 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

## Comparing the Scores Finally, I measured which method was better. I ran the clustering again on the simplified data from PCA and t-SNE. Then I compared these new clusters to my original ones using the Rand Index.

A higher score means the simplified data kept more of the original information. As you can see, t-SNE performed differently slightly better at preserving the local structure.

pca_data_50 <- pca_res$x[, 1]
clara_pca <- eclust(pca_data_50, "clara", k = 2, graph = FALSE)

tsne_data <- data.frame(x = tsne_out$Y[,1], y = tsne_out$Y[,2])
clara_tsne <- eclust(tsne_data, "clara", k = 2, graph = FALSE)

randtSne <- randIndex(clara_flex$clustering, clara_tsne$clustering, correct = FALSE)
randPca <- randIndex(clara_flex$clustering, clara_pca$clustering, correct = FALSE)


comparison_metrics <- data.frame(
  Metric = c("tSne", "PCA"),
  Score = c(randtSne, randPca)
)
print(comparison_metrics)

##   Metric     Score
## 1   tSne 0.7811406
## 2    PCA 0.7818914

Satellite Image Clustering

Nijat Kazimli

2026-01-24