Cluster Analysis in Market Basket Data

1. Introduction

Market basket analysis is a fundamental tool in retail analytics, allowing businesses to understand customer purchasing patterns and optimize product placement, promotions, and inventory management. This project aims to apply clustering techniques to transactional retail data to identify groups of items frequently purchased together. We employed K-means clustering and Principal Component Analysis (PCA) to achieve this, with the goal of deriving actionable insights for business optimization.

2. Data Cleaning and Preprocessing

The dataset comprises retail transactions, including invoice numbers, product descriptions, quantities, prices, and customer details. To prepare the data for clustering, the following steps were taken:

Removal of Missing Values: Rows with missing values in key fields such as Description and Invoice were excluded to maintain data integrity.
```
data <- read.csv("online_retail.csv")
data <- data[complete.cases(data), ]
```
Filtering Negative Quantities and Prices: Transactions with negative quantities (indicating returns) and non-positive prices were removed to focus on actual sales.
```
data <- data[data$Quantity > 0 & data$Price > 0, ]
```
Normalization of Descriptions: Product descriptions were converted to lowercase and trimmed of extra spaces for consistency.
```
data$Description <- tolower(trimws(data$Description))
```
Conversion to Transactions: Transactions were grouped by invoice numbers to create a binary transaction matrix, with rows representing invoices and columns representing items.
```
transactions <- as(split(data$Description, data$Invoice), "transactions")
```
Subsampling: To manage computational complexity, a random sample of 5000 transactions was selected.
```
set.seed(123)
sampled_transactions <- sample(transactions, size = 5000)
```
Frequency Matrix: The sampled transactions were converted into a frequency matrix suitable for clustering.
```
item_matrix <- as(sampled_transactions, "matrix")
item_frequency <- as.data.frame(item_matrix)
```
Removal of Constant/Zero-Variance Columns: Columns with zero variance, indicating items not purchased in the sample, were removed.
```
constant_columns <- which(apply(item_frequency, 2, var) == 0)
item_frequency <- item_frequency[, -constant_columns]
```
Removal of Rarely Purchased Items: Items purchased in less than 1% of transactions were excluded to reduce noise.
```
item_frequency <- item_frequency[, colSums(item_frequency) > 0.01 * nrow(item_frequency)]
```
Scaling the Data: The data was scaled to ensure uniformity in clustering.
```
item_frequency <- scale(item_frequency)
```

Transaction Matrix

3. Dimension Reduction using PCA

To address the high dimensionality of the dataset, Principal Component Analysis (PCA) was applied to reduce the data to its most significant components.

Applying PCA: PCA was conducted on the scaled item frequency matrix to identify the components that explain the most variance.
```
pca_result <- prcomp(item_frequency, center = TRUE, scale. = TRUE)
```
Explained Variance: The variance explained by each principal component was visualized, and the first five components were selected for clustering as they captured the majority of the variance.
```
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 50))
reduced_data <- as.data.frame(pca_result$x[, 1:5])
```

4. Finding the Optimal Number of Clusters

Initial Attempt with k = 4

Initially, the Elbow Method suggested $k = 4$ as a potential number of clusters. However, the resulting silhouette score of 0.09 indicated poor clustering quality.

Elbow Method

fviz_nbclust(item_frequency, kmeans, method = "wss", k.max = 10)

Switching to Silhouette Analysis

To refine the clustering, the Silhouette Method was applied to the PCA-reduced data. This method revealed that $k = 2$ provided the best clustering quality, with an average silhouette score of 0.3.

Silhouette Method

fviz_nbclust(reduced_data, kmeans, method = "silhouette", k.max = 10)

5. K-means Clustering with PCA Components

Using the PCA-reduced data, K-means clustering was applied with $k = 2$.

Applying K-means Clustering:

set.seed(123)
kmeans_result <- kmeans(reduced_data, centers = 2, nstart = 25)

Cluster Visualization: The clusters were visualized using the first two principal components, showing distinct groupings.

K-means Clustering
```
fviz_cluster(kmeans_result, data = reduced_data, geom = "point") +
  labs(title = "K-means Clustering with PCA")
```

Silhouette Plot: The silhouette plot confirmed moderate clustering quality with an average silhouette score of 0.3.

Silhouette Plot

sil <- silhouette(kmeans_result$cluster, dist(reduced_data))
fviz_silhouette(sil) +
  labs(title = "Silhouette Plot for K-means Clustering with PCA")

6. Results and Insights

Cluster Characteristics

Cluster 1:
- Represents staple items frequently purchased together.
- Ideal for promotions and product bundling.
Cluster 2:
- Comprises seasonal or niche items with more variable purchasing patterns.
- Requires targeted marketing and careful inventory management.

Business Applications

Store Layout:
- Place Cluster 1 items together for convenience.
- Distribute Cluster 2 items to encourage browsing.
Marketing Strategies:
- Bundle Cluster 1 items to increase average basket size.
- Use personalized campaigns for Cluster 2 items.
Inventory Management:
- Maintain high stock levels for Cluster 1 items.
- Monitor demand trends for Cluster 2 items.

7. Limitations

Limitations

The silhouette score of 0.3 suggests some overlap between clusters.
Subsampling may have excluded rare but significant patterns.

8. Conclusion

This project demonstrated the application of PCA and K-means clustering to uncover purchasing patterns in retail data. The analysis identified two distinct clusters, providing actionable insights for optimizing store layout, marketing strategies, and inventory management. Future work will focus on refining clustering techniques and expanding the analysis to include customer-level data.

Cluster Analysis in Market Basket Data

Chiedza Chimedza

2025-02-01

Cluster Analysis in Market Basket Data

1. Introduction

2. Data Cleaning and Preprocessing

3. Dimension Reduction using PCA

4. Finding the Optimal Number of Clusters

Initial Attempt with k = 4

Switching to Silhouette Analysis

5. K-means Clustering with PCA Components

6. Results and Insights

Cluster Characteristics

Business Applications

7. Limitations

Limitations

8. Conclusion