This study looks at how computers can group satellite images without labels. I used the EuroSAT dataset for my tests. I took a random sample of 2,000 images and used the CLARA algorithm on them. I only looked at grayscale pixel brightness. I also compared two ways to simplify the data: PCA and t-SNE. PCA draws straight lines, but t-SNE is more flexible. My results show that t-SNE is much better at showing the real shapes in land-use data.
First, I need to find all the image files. I used the EuroSAT dataset. It contains many satellite photos of different land types. I also set a “seed” here. This makes sure my random choices remain the same if I run the code again.
euroSAT <- list.files("EuroSAT_RGB", pattern = "\\.jpg$", full.names = TRUE, recursive = TRUE)
set.seed(123)
Processing every single image takes too long. So, I picked 2,000 images at random. I also converted them to grayscale. This means I averaged the Red, Green, and Blue colors into one brightness value.
After that, I flattened the images into simple lists of numbers. I also “scaled” the data. This standardizes the numbers so that very bright pixels don’t confuse the algorithm.
sample_files <- sample(euroSAT, 2000)
process_image <- function(filepath) {
img <- readJPEG(filepath)
# Converting to grayscale
if (length(dim(img)) == 3) {
img <- (img[,,1] + img[,,2] + img[,,3]) / 3
}
as.vector(img)
}
image_matrix <- t(sapply(sample_files, process_image))
df <- as.data.frame(image_matrix)
true_labels <- sapply(strsplit(sample_files, "/"), function(x) x[length(x)-1])
table(true_labels)
## true_labels
## AnnualCrop Forest HerbaceousVegetation
## 210 228 217
## Highway Industrial Pasture
## 207 152 159
## PermanentCrop Residential River
## 181 232 172
## SeaLake
## 242
df_scaled <- scale(df)
dim(df_scaled)
## [1] 2000 4096
Before I start, I should check if the data actually has groups. Sometimes data is just random noise. I used the Hopkins Statistic to check this. If the value is close to 1, the data has good clusters. As we can see, it is very much clustarable.
get_clust_tendency(df_scaled, n = 50, graph = FALSE)$hopkins_stat
## [1] 0.8141851
I have a lot of data, so normal clustering methods are too slow. Instead, I used CLARA. It is a faster version of the standard “k-medoids” method. It takes small samples of the data to find the best groups.
First, I checked how many clusters I should look for. The “Silhouette Method” suggested that 2 clusters was the best choice. Then, I ran the algorithm to split the images into two groups.
fviz_nbclust(df_scaled[1:500, ], clara, method = "silhouette") +
labs(subtitle = "Silhouette Method")
clara_flex <- eclust(df_scaled, "clara", k = 2, graph = FALSE)
fviz_cluster(clara_flex, geom = "point")
fviz_silhouette(clara_flex)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## cluster size ave.sil.width
## 1 1 1035 0.12
## 2 2 965 0.59
Now I want to see what is inside my clusters. I compared my computer-generated groups with the real labels (like Forest, Highway, or River). This is called “purity analysis.”
The plot below shows the mix. Cluster 1 is messy and has many man-made things like highways and cities. Cluster 2 is cleaner and mostly contains nature, like water and forests.
conf_mat <- table(Cluster = clara_flex$clustering, Truth = true_labels)
conf_df <- as.data.frame(conf_mat)
land_use_colors <- c(
"Forest" = "forestgreen",
"SeaLake" = "dodgerblue3",
"Industrial" = "firebrick",
"Highway" = "darkgrey",
"River" = "blue",
"Pasture" = "lawngreen",
"Residential" = "purple",
"AnnualCrop" = "gold",
"HerbaceousVegetation" = "yellowgreen",
"PermanentCrop" = "chartreuse4"
)
ggplot(conf_df, aes(x = factor(Cluster), y = Freq, fill = Truth)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Cluster Purity Analysis",
x = "Cluster ID",
y = "Proportion of Land Use Types",
fill = "True Label") +
theme_minimal() +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = land_use_colors)
## Dimension Reduction - PCA It is hard to visualize data with thousands
of dimensions. PCA helps squash this down to 2 dimensions. It finds the
“directions” where the data varies the most.
The plot below shows the result. The groups look like two big blobs that overlap a lot. This happens because PCA is a linear method. It tries to draw straight lines through complicated data.
pca_res <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
fviz_eig(pca_res, addlabels = TRUE, main = "PCA: Variance Explained")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
fviz_pca_ind(pca_res,
geom.ind = "point",
pointsize = 1,
col.ind = as.factor(clara_flex$clustering),
palette = "jco",
addEllipses = TRUE,
legend.title = "Cluster",
title = "PCA Result (Linear Reduction)")
## Ignoring unknown labels:
## • linetype : "Cluster"
## Dimension Reduction - t-SNE Since PCA struggled, I tried t-SNE. This
is a non-linear method. It looks at the “neighbors” of each point
instead of global variance. It is much better at unfolding complex
shapes.
The result is very different. You can see a curved shape, almost like a horseshoe. This shows that the land types flow into each other. They don’t just sit in separate boxes.
df_unique <- unique(df_scaled)
tsne_out <- Rtsne(df_unique, dims = 2, pca = TRUE, perplexity = 30, verbose = FALSE)
tsne_plot <- data.frame(x = tsne_out$Y[,1],
y = tsne_out$Y[,2],
Cluster = as.factor(clara_flex$clustering))
ggplot(tsne_plot, aes(x = x, y = y, color = Cluster)) +
geom_point(alpha = 0.7) +
labs(title = "t-SNE Visualization (Non-Linear)",
x = "t-SNE Dimension 1",
y = "t-SNE Dimension 2") +
theme_minimal() +
scale_color_brewer(palette = "Set1")
## Comparing the Scores Finally, I measured which method was better. I
ran the clustering again on the simplified data from PCA and t-SNE. Then
I compared these new clusters to my original ones using the Rand
Index.
A higher score means the simplified data kept more of the original information. As you can see, t-SNE performed differently slightly better at preserving the local structure.
pca_data_50 <- pca_res$x[, 1]
clara_pca <- eclust(pca_data_50, "clara", k = 2, graph = FALSE)
tsne_data <- data.frame(x = tsne_out$Y[,1], y = tsne_out$Y[,2])
clara_tsne <- eclust(tsne_data, "clara", k = 2, graph = FALSE)
randtSne <- randIndex(clara_flex$clustering, clara_tsne$clustering, correct = FALSE)
randPca <- randIndex(clara_flex$clustering, clara_pca$clustering, correct = FALSE)
comparison_metrics <- data.frame(
Metric = c("tSne", "PCA"),
Score = c(randtSne, randPca)
)
print(comparison_metrics)
## Metric Score
## 1 tSne 0.7811406
## 2 PCA 0.7818914