#Packages installed install.packages(c(“keras”, “tidyverse”, “Rtsne”, “uwot”, “cluster”, “factoextra”, “ggplot2”, “plotly”,dependencies=TRUE))
This project explores dimensionality reduction and clustering techniques using the Fashion MNIST dataset, a collection of grayscale images representing 10 different categories of clothing items. The goal is to understand how to reduce high-dimensional data while preserving important structures and patterns, and then cluster similar images together.
Fashion MNIST dataset has been made available by Zalando SE Research team at https://github.com/zalandoresearch/fashion-mnist/tree/master/data/fashion. The Fashion MNIST is a dataset of Zalando’s article images, consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. The dataset serves as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. Here I have used the Fashion-MNIST test dataset for the analysis, wich consists of 10,000 examples. Here is an example of how the data looks (each class takes three-rows): #eval=TRUEHow the data looks like
In this we will be covering the following:
Principal Component Analysis (PCA) for dimension reduction.
t-SNE & UMAP to visualize high-dimensional data in 2D.
K-Means & Hierarchical Clustering for clustering images as groups.
Silhouette Score to evaluate clustering quality.
Visualizations to visualize the information for better understanding.
library(keras)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Rtsne)
library(uwot)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
fashion_mnist <- read.csv("archive/fashion-mnist_test.csv")
str(fashion_mnist)
## 'data.frame': 10000 obs. of 785 variables:
## $ label : int 0 1 2 2 3 2 8 6 5 0 ...
## $ pixel1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel6 : int 0 0 0 0 0 44 0 0 0 0 ...
## $ pixel7 : int 0 0 14 0 0 105 0 0 0 0 ...
## $ pixel8 : int 9 0 53 0 0 44 0 1 0 0 ...
## $ pixel9 : int 8 0 99 0 0 10 0 0 0 0 ...
## $ pixel10 : int 0 0 17 161 0 0 0 0 0 0 ...
## $ pixel11 : int 0 0 0 212 37 0 0 0 0 92 ...
## $ pixel12 : int 34 209 0 138 0 0 0 108 0 66 ...
## $ pixel13 : int 29 190 0 150 0 0 0 25 0 0 ...
## $ pixel14 : int 7 181 0 169 0 0 0 0 0 0 ...
## $ pixel15 : int 0 150 0 164 0 0 0 0 0 0 ...
## $ pixel16 : int 11 170 0 176 0 0 0 0 0 0 ...
## $ pixel17 : int 24 193 0 202 0 0 0 132 0 81 ...
## $ pixel18 : int 0 180 0 255 17 0 0 54 0 91 ...
## $ pixel19 : int 0 219 12 183 0 0 0 0 0 0 ...
## $ pixel20 : int 3 5 94 26 0 0 0 0 0 0 ...
## $ pixel21 : int 3 0 68 0 0 34 0 0 0 0 ...
## $ pixel22 : int 1 0 14 0 0 68 0 2 0 0 ...
## $ pixel23 : int 0 0 0 0 0 34 0 0 0 0 ...
## $ pixel24 : int 1 0 0 0 0 0 0 1 0 0 ...
## $ pixel25 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ pixel26 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel27 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel28 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel29 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel30 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel31 : int 4 0 0 0 0 0 0 0 0 0 ...
## $ pixel32 : int 0 0 0 1 0 0 0 0 0 0 ...
## $ pixel33 : int 0 0 0 0 0 34 0 0 0 0 ...
## $ pixel34 : int 1 0 38 0 0 136 0 0 0 0 ...
## $ pixel35 : int 0 0 106 129 0 102 0 0 0 0 ...
## $ pixel36 : int 0 0 94 221 0 105 0 0 0 27 ...
## $ pixel37 : int 0 0 89 255 21 98 0 11 0 75 ...
## $ pixel38 : int 0 0 94 255 204 74 0 61 0 111 ...
## $ pixel39 : int 0 24 68 215 235 64 0 102 0 118 ...
## $ pixel40 : int 44 235 2 184 235 34 0 117 0 171 ...
## $ pixel41 : int 88 210 0 127 202 27 0 158 0 172 ...
## $ pixel42 : int 99 241 0 185 166 20 0 87 0 153 ...
## $ pixel43 : int 122 222 0 198 180 13 0 66 0 150 ...
## $ pixel44 : int 123 171 0 213 207 20 0 175 0 176 ...
## $ pixel45 : int 80 220 0 239 217 27 0 109 0 193 ...
## $ pixel46 : int 0 199 140 126 228 47 0 131 0 117 ...
## $ pixel47 : int 0 236 116 237 112 71 0 87 0 129 ...
## $ pixel48 : int 0 27 85 255 0 85 0 43 0 103 ...
## $ pixel49 : int 0 0 109 236 0 85 0 0 0 56 ...
## $ pixel50 : int 1 0 121 173 0 91 0 0 0 0 ...
## $ pixel51 : int 1 0 63 4 0 136 0 1 0 0 ...
## $ pixel52 : int 1 0 0 0 0 27 0 0 16 0 ...
## $ pixel53 : int 0 0 0 0 0 0 0 0 204 0 ...
## $ pixel54 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel55 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel56 : int 0 0 0 0 0 0 0 0 1 0 ...
## $ pixel57 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel58 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel59 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ pixel60 : int 2 0 0 0 0 0 0 0 0 0 ...
## $ pixel61 : int 0 0 0 0 0 102 0 0 0 0 ...
## $ pixel62 : int 0 0 70 59 0 119 0 2 0 30 ...
## $ pixel63 : int 0 0 58 219 0 74 0 0 0 110 ...
## $ pixel64 : int 3 0 58 248 0 91 0 43 0 107 ...
## $ pixel65 : int 46 0 68 231 195 81 0 126 0 115 ...
## $ pixel66 : int 174 0 68 254 246 95 0 124 0 98 ...
## $ pixel67 : int 249 103 101 27 219 95 0 90 0 79 ...
## $ pixel68 : int 67 227 114 3 234 88 0 131 0 95 ...
## $ pixel69 : int 0 217 72 221 245 95 0 173 0 119 ...
## $ pixel70 : int 94 218 189 253 239 88 0 178 0 139 ...
## $ pixel71 : int 210 222 233 242 244 78 0 140 0 153 ...
## $ pixel72 : int 61 189 184 79 246 78 0 123 0 132 ...
## $ pixel73 : int 14 216 82 190 236 81 0 138 19 83 ...
## $ pixel74 : int 212 201 254 123 226 78 0 70 61 75 ...
## $ pixel75 : int 157 215 55 228 236 71 0 98 197 89 ...
## $ pixel76 : int 37 103 80 230 11 68 0 119 181 106 ...
## $ pixel77 : int 0 0 72 238 0 68 0 127 197 106 ...
## $ pixel78 : int 0 0 60 243 0 61 0 8 88 102 ...
## $ pixel79 : int 0 0 92 140 0 68 0 0 37 35 ...
## $ pixel80 : int 0 0 0 0 0 115 0 1 168 0 ...
## $ pixel81 : int 1 0 0 0 0 0 0 0 216 0 ...
## $ pixel82 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel83 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel84 : int 0 0 0 0 0 0 0 0 1 0 ...
## $ pixel85 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel86 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pixel87 : int 2 0 0 1 0 0 0 0 0 0 ...
## $ pixel88 : int 2 0 0 0 0 0 0 0 0 0 ...
## $ pixel89 : int 0 0 0 4 0 153 0 0 0 0 ...
## $ pixel90 : int 23 0 77 150 0 102 0 0 0 117 ...
## $ pixel91 : int 168 0 43 62 0 88 0 0 0 96 ...
## $ pixel92 : int 206 0 72 246 0 91 0 119 0 89 ...
## $ pixel93 : int 242 0 72 225 245 71 0 121 0 91 ...
## $ pixel94 : int 239 0 75 255 226 78 0 105 0 90 ...
## $ pixel95 : int 238 197 63 17 223 71 0 91 1 98 ...
## $ pixel96 : int 214 221 70 0 223 64 0 96 0 95 ...
## $ pixel97 : int 125 201 75 226 222 68 0 58 0 88 ...
## $ pixel98 : int 61 212 121 191 228 68 0 58 132 86 ...
## [list output truncated]
# Extract features and labels
x_train <- as.matrix(fashion_mnist[, -1]) # All columns except the first are features
y_train <- as.factor(fashion_mnist[, 1]) # The first column contains labels
# Normalize pixel values to range (0-1)
x_train <- x_train / 255
# Remove duplicate rows to prevent errors in t-SNE
x_train <- unique(x_train)
y_train <- y_train[1:nrow(x_train)]
PCA is a dimensionality reduction technique that helps in reducing the number of dimensions while retaining as much of the variance as possible. It transforms correlated variables into a set of uncorrelated components, ordered by the amount of variance they explain.
The following plot shows the variance explained by each principal component.
pca_result <- prcomp(x_train, center = TRUE, scale. = TRUE)
explained_variance <- summary(pca_result)$importance[2,]
pca_df <- data.frame(PC=1:length(explained_variance), Variance=explained_variance)
ggplotly(
ggplot(pca_df, aes(x=PC, y=Variance)) +
geom_line() + geom_point() + ggtitle("PCA Variance Explained")
)
This PCA variance explained plot shows the proportion of variance retained by each principal component. The steep drop in variance means that a few components capture most of the variance in the data, making PCA useful for dimensionality reduction.
t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method used for visualization. It preserves local relationships/similarities in high-dimensional data and projects them into 2D space for easier interpretation.
tsne_result <- Rtsne(x_train, dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500)
## Performing PCA
## Read the 9999 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 3.80 seconds (sparsity = 0.012302)!
## Learning embedding...
## Iteration 50: error is 97.327339 (50 iterations in 1.62 seconds)
## Iteration 100: error is 83.333730 (50 iterations in 1.76 seconds)
## Iteration 150: error is 79.433312 (50 iterations in 1.50 seconds)
## Iteration 200: error is 78.441535 (50 iterations in 1.45 seconds)
## Iteration 250: error is 78.015568 (50 iterations in 1.43 seconds)
## Iteration 300: error is 2.839026 (50 iterations in 1.33 seconds)
## Iteration 350: error is 2.443910 (50 iterations in 1.20 seconds)
## Iteration 400: error is 2.221058 (50 iterations in 1.21 seconds)
## Iteration 450: error is 2.074211 (50 iterations in 1.27 seconds)
## Iteration 500: error is 1.969357 (50 iterations in 1.23 seconds)
## Fitting performed in 14.01 seconds.
tsne_df <- data.frame(X = tsne_result$Y[,1], Y = tsne_result$Y[,2], Label = y_train)
ggplotly(
ggplot(tsne_df, aes(x = X, y = Y, color = Label)) + geom_point(alpha = 0.7) + ggtitle("t-SNE Visualization")
)
Clusters in this t-SNE plot represent different categories of clothing. Data points close together belong to similar categories, while distant clusters indicate distinct groups.
UMAP (Uniform Manifold Approximation and Projection) is another dimensionality reduction technique, , similar to t-SNE, but unlike t-SNE UMAP is often faster and good at preserving more of the global structures in the data.
umap_result <- umap(x_train, n_components = 2)
umap_df <- data.frame(X = umap_result[,1], Y = umap_result[,2], Label = y_train)
ggplotly(
ggplot(umap_df, aes(x = X, y = Y, color = Label)) + geom_point(alpha = 0.7) + ggtitle("UMAP Visualization")
)
Here the clusters in the UMAP plot indicates groups of similar images with a more compact and accurate representation as compared to the t-SNE.
K-Means is a popular clustering technique/algorithm that groups/divides data points into K clusters, by minimizing variance within each cluster.
Here, we apply K-Means clustering to group similar images together based on their pixel values.
The optimal number of clusters is set to 10, corresponding to the 10 clothing categories.
The silhouette score measures the quality of clusters. It ranges from -1 to 1, where:
|Range Value| Interpretation | |———–|————————–| |1 | Well-clustered points| |0 | Overlapping clusters | |-1 | Poor clustering |
kmeans_result <- kmeans(x_train, centers = 10, nstart = 25)
silhouette_score <- silhouette(kmeans_result$cluster, dist(x_train))
mean(silhouette_score[, 3])
## [1] 0.1564944
fviz_cluster(list(data = x_train, cluster = kmeans_result$cluster))
This K-Means plot shows the division of data points into 10 clusters (for 10 clothing categories).
Hierarchical clustering is a method of clustering that builds a tree (dendrogram) of nested clusters, very useful for understanding hierarchical relationships in data. The dendrogram shows how the data points are merged into clusters based on their similarity.
dist_matrix <- dist(x_train[1:1000,]) # Sample 1000 for speed
hc_result <- hclust(dist_matrix, method = "ward.D2")
plot(hc_result, labels = FALSE, main = "Hierarchical Clustering")
This dendrogram shows how clusters are merged at different levels. Closer points indicate stronger relationships.
This whole project demonstrated the use of dimensionality reduction and clustering techniques on the Fashion MNIST dataset. Here are some key takeaways:
PCA helps reduce dimensionality while retaining variance.
t-SNE & UMAP are useful for visualizing high-dimensional data in 2D.
K-Means & Hierarchical Clustering effectively grouped images.
Silhouette Score provides a way to evaluate clustering quality.
Interactive visualizations make the analysis more insightful and enhance interpretability.
This project can be further extended by exploring other dimensionality reduction techniques, clustering algorithms, and hyperparameter tuning to improve the clustering results. For now, I hope this project provides a good starting point for understanding these techniques and their applications in real-world datasets. Thank you for reading! 🚀 Ramik Sharma