Market basket analysis is a fundamental tool in retail analytics, allowing businesses to understand customer purchasing patterns and optimize product placement, promotions, and inventory management. This project aims to apply clustering techniques to transactional retail data to identify groups of items frequently purchased together. We employed K-means clustering and Principal Component Analysis (PCA) to achieve this, with the goal of deriving actionable insights for business optimization.
The dataset comprises retail transactions, including invoice numbers, product descriptions, quantities, prices, and customer details. To prepare the data for clustering, the following steps were taken:
Removal of Missing Values: Rows with missing
values in key fields such as Description
and
Invoice
were excluded to maintain data integrity.
data <- read.csv("online_retail.csv")
data <- data[complete.cases(data), ]
Filtering Negative Quantities and Prices: Transactions with negative quantities (indicating returns) and non-positive prices were removed to focus on actual sales.
data <- data[data$Quantity > 0 & data$Price > 0, ]
Normalization of Descriptions: Product descriptions were converted to lowercase and trimmed of extra spaces for consistency.
data$Description <- tolower(trimws(data$Description))
Conversion to Transactions: Transactions were grouped by invoice numbers to create a binary transaction matrix, with rows representing invoices and columns representing items.
transactions <- as(split(data$Description, data$Invoice), "transactions")
Subsampling: To manage computational complexity, a random sample of 5000 transactions was selected.
set.seed(123)
sampled_transactions <- sample(transactions, size = 5000)
Frequency Matrix: The sampled transactions were converted into a frequency matrix suitable for clustering.
item_matrix <- as(sampled_transactions, "matrix")
item_frequency <- as.data.frame(item_matrix)
Removal of Constant/Zero-Variance Columns: Columns with zero variance, indicating items not purchased in the sample, were removed.
constant_columns <- which(apply(item_frequency, 2, var) == 0)
item_frequency <- item_frequency[, -constant_columns]
Removal of Rarely Purchased Items: Items purchased in less than 1% of transactions were excluded to reduce noise.
item_frequency <- item_frequency[, colSums(item_frequency) > 0.01 * nrow(item_frequency)]
Scaling the Data: The data was scaled to ensure uniformity in clustering.
item_frequency <- scale(item_frequency)
To address the high dimensionality of the dataset, Principal Component Analysis (PCA) was applied to reduce the data to its most significant components.
Applying PCA: PCA was conducted on the scaled item frequency matrix to identify the components that explain the most variance.
pca_result <- prcomp(item_frequency, center = TRUE, scale. = TRUE)
Explained Variance: The variance explained by each principal component was visualized, and the first five components were selected for clustering as they captured the majority of the variance.
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 50))
reduced_data <- as.data.frame(pca_result$x[, 1:5])
Initially, the Elbow Method suggested \(k = 4\) as a potential number of clusters. However, the resulting silhouette score of 0.09 indicated poor clustering quality.
fviz_nbclust(item_frequency, kmeans, method = "wss", k.max = 10)
To refine the clustering, the Silhouette Method was applied to the PCA-reduced data. This method revealed that \(k = 2\) provided the best clustering quality, with an average silhouette score of 0.3.
fviz_nbclust(reduced_data, kmeans, method = "silhouette", k.max = 10)
Using the PCA-reduced data, K-means clustering was applied with \(k = 2\).
Applying K-means Clustering:
set.seed(123)
kmeans_result <- kmeans(reduced_data, centers = 2, nstart = 25)
Cluster Visualization: The clusters were visualized using the first two principal components, showing distinct groupings.
fviz_cluster(kmeans_result, data = reduced_data, geom = "point") +
labs(title = "K-means Clustering with PCA")
Silhouette Plot: The silhouette plot confirmed moderate clustering quality with an average silhouette score of 0.3.
sil <- silhouette(kmeans_result$cluster, dist(reduced_data))
fviz_silhouette(sil) +
labs(title = "Silhouette Plot for K-means Clustering with PCA")
This project demonstrated the application of PCA and K-means clustering to uncover purchasing patterns in retail data. The analysis identified two distinct clusters, providing actionable insights for optimizing store layout, marketing strategies, and inventory management. Future work will focus on refining clustering techniques and expanding the analysis to include customer-level data.