This study presents an integrated framework combining clustering (K-Means, DBSCAN), dimensionality reduction (PCA, UMAP), and association rule mining (Apriori, Eclat) to extract actionable insights from retail data. Using a Kaggle dataset of over 1,000 customer transactions, we identify three distinct customer segments: high-spending youth, older frequent buyers, and budget-conscious middle-aged shoppers. We link these segments to product affinities, such as the association between blouses and jewelry. Unlike prior studies treating these methods separately, our integrated approach enables cluster-specific marketing strategies such as personalized bundling and influencer-driven campaigns. We validate cluster robustness through multi-algorithm consensus and demonstrate UMAP’s effectiveness over PCA in capturing nonlinear demographic-spending relationships. The study also discusses limitations such as parameter sensitivity and data granularity, offering insights for future research and practical applications.
This study addresses three key gaps in retail analytics:
Connecting Customer Behavior with Product Associations: We integrate clustering techniques with association rule mining to link customer segments to purchasing patterns, enabling precise marketing strategies.
Advanced Data Visualization: We demonstrate that UMAP outperforms PCA in revealing nonlinear relationships within customer data, improving segmentation accuracy.
Ensuring Robustness in Clustering: We employ multiple clustering techniques (K-Means, DBSCAN, Hierarchical Clustering) to validate segmentation results, ensuring meaningful customer groupings.
Clustering: Traditional segmentation studies favor K-Means (Kassambara, 2017), while DBSCAN remains underutilized despite its effectiveness in identifying niche groups like luxury shoppers (Ester et al., 1996).
Association Rule Mining: Classic studies (Agrawal & Srikant, 1994) focus on broad purchasing patterns (e.g., “milk → bread”) but fail to incorporate customer segment context.
Dimensionality Reduction: PCA is the dominant technique in retail analytics, yet UMAP (McInnes et al., 2018) provides more effective nonlinear visualizations (Chen & Zhang, 2021).
Key Gaps Addressed:
No prior study combines clustering, UMAP, and association rule mining for retail analysis.
Existing research rarely links product association rules with customer segments.
2.1 Dataset Description
The dataset used in this study is available on Kaggle: Customer Shopping Trends Dataset. It contains customer-level data such as:
-Demographics: Age, gender, income, location, etc.
-Behavioral Data: Purchase frequency, product categories bought, spending amount, etc.
These attributes support clustering (K-Means, DBSCAN) and association rule mining to uncover frequently co-purchased items.
2.2 Data Preprocessing
# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(dbscan)
##
## Attaching package: 'dbscan'
##
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(umap)
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
# Load the dataset
data <- read_csv("C:/Users/johns/Downloads/shopping_trends.csv")
## Rows: 3900 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Gender, Item Purchased, Category, Location, Size, Color, Season, S...
## dbl (5): Customer ID, Age, Purchase Amount (USD), Review Rating, Previous P...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Inspect the dataset
glimpse(data)
## Rows: 3,900
## Columns: 19
## $ `Customer ID` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ Age <dbl> 55, 19, 50, 21, 45, 46, 63, 27, 26, 57, 53,…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ `Item Purchased` <chr> "Blouse", "Sweater", "Jeans", "Sandals", "B…
## $ Category <chr> "Clothing", "Clothing", "Clothing", "Footwe…
## $ `Purchase Amount (USD)` <dbl> 53, 64, 73, 90, 49, 20, 85, 34, 97, 31, 34,…
## $ Location <chr> "Kentucky", "Maine", "Massachusetts", "Rhod…
## $ Size <chr> "L", "L", "S", "M", "M", "M", "M", "L", "L"…
## $ Color <chr> "Gray", "Maroon", "Maroon", "Maroon", "Turq…
## $ Season <chr> "Winter", "Winter", "Spring", "Spring", "Sp…
## $ `Review Rating` <dbl> 3.1, 3.1, 3.1, 3.5, 2.7, 2.9, 3.2, 3.2, 2.6…
## $ `Subscription Status` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
## $ `Payment Method` <chr> "Credit Card", "Bank Transfer", "Cash", "Pa…
## $ `Shipping Type` <chr> "Express", "Express", "Free Shipping", "Nex…
## $ `Discount Applied` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
## $ `Promo Code Used` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
## $ `Previous Purchases` <dbl> 14, 2, 23, 49, 31, 14, 49, 19, 8, 4, 26, 10…
## $ `Preferred Payment Method` <chr> "Venmo", "Cash", "Credit Card", "PayPal", "…
## $ `Frequency of Purchases` <chr> "Fortnightly", "Fortnightly", "Weekly", "We…
head(data)
## # A tibble: 6 × 19
## `Customer ID` Age Gender `Item Purchased` Category `Purchase Amount (USD)`
## <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 55 Male Blouse Clothing 53
## 2 2 19 Male Sweater Clothing 64
## 3 3 50 Male Jeans Clothing 73
## 4 4 21 Male Sandals Footwear 90
## 5 5 45 Male Blouse Clothing 49
## 6 6 46 Male Sneakers Footwear 20
## # ℹ 13 more variables: Location <chr>, Size <chr>, Color <chr>, Season <chr>,
## # `Review Rating` <dbl>, `Subscription Status` <chr>, `Payment Method` <chr>,
## # `Shipping Type` <chr>, `Discount Applied` <chr>, `Promo Code Used` <chr>,
## # `Previous Purchases` <dbl>, `Preferred Payment Method` <chr>,
## # `Frequency of Purchases` <chr>
# Check for missing values and remove duplicates
data <- data %>% drop_na() %>% distinct()
# Select only numerical features for clustering
numeric_data <- data %>%
select(Age, `Previous Purchases`, `Purchase Amount (USD)`) %>%
mutate_all(scale)
head(numeric_data)
## # A tibble: 6 × 3
## Age[,1] `Previous Purchases`[,1] `Purchase Amount (USD)`[,1]
## <dbl> <dbl> <dbl>
## 1 0.719 -0.786 -0.286
## 2 -1.65 -1.62 0.179
## 3 0.390 -0.163 0.559
## 4 -1.52 1.64 1.28
## 5 0.0613 0.391 -0.454
## 6 0.127 -0.786 -1.68
Clustering algorithms aim to group customers with similar characteristics, enabling businesses to tailor strategies for distinct segments.
K-Means Clustering
K-Means clustering groups data into a predefined number of clusters (k). We can visualize the clusters using a scatter plot, colored by cluster assignment.
# K-Means Clustering
set.seed(123)
kmeans_result <- kmeans(numeric_data, centers = 3, nstart = 25,iter.max = 500)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 195000)
data$Cluster <- as.factor(kmeans_result$cluster)
# Plot K-Means Clustering
fviz_cluster(kmeans_result, data = numeric_data, geom = "point")
DBSCAN Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups data points based on density. It can identify noise points that do not belong to any cluster.
# Determine optimal eps using k-nearest neighbors distance plot
library(dbscan)
kNN_dist <- kNNdist(numeric_data, k = 5)
plot(sort(kNN_dist), type = "l", main = "k-NN Distance Plot",
xlab = "Points sorted by distance", ylab = "5-NN Distance")
abline(h = 0.5, col = "red", lty = 2) # Adjust based on elbow point
# Apply DBSCAN with optimized eps value
set.seed(123)
dbscan_result <- dbscan(numeric_data, eps = 0.5, minPts = 5)
# Assign cluster labels (including noise points)
data$DBSCAN_Cluster <- as.factor(dbscan_result$cluster)
# Visualize DBSCAN clustering
fviz_cluster(list(cluster = dbscan_result$cluster, data = numeric_data),
geom = "point", ellipse = FALSE, ggtheme = theme_minimal()) +
labs(title = "DBSCAN Clustering with Optimized Parameters",
subtitle = "Clusters detected including noise points")
# Count the number of noise points
sum(dbscan_result$cluster == 0) # Noise points are labeled as '0'
## [1] 0
Hierarchical Clustering
The hierarchical clustering analysis merges or splits clusters based on similarity.
# Compute hierarchical clustering
distance_matrix <- dist(numeric_data)
hclust_result <- hclust(distance_matrix, method = "ward.D2")
# Convert to dendrogram
dend <- as.dendrogram(hclust_result)
# Plot the dendrogram
plot(dend, main = "Hierarchical Clustering Dendrogram",
sub = "Ward's Method", xlab = "Clusters", ylab = "Height",
cex = 0.8)
Dimensionality reduction simplifies complex datasets while preserving meaningful patterns, making it easier to visualize and interpret. Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) were used to validate and refine customer segmentation.
Principal Component Analysis (PCA)
pca_result <- prcomp(numeric_data, scale = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.0201 1.0019 0.9776
## Proportion of Variance 0.3468 0.3346 0.3186
## Cumulative Proportion 0.3468 0.6814 1.0000
# Visualize variance
fviz_eig(pca_result, addlabels = TRUE,
barfill = "#1f77b4", barcolor = "#1f77b4") +
labs(title = "Variance Explained by Principal Components",
x = "Principal Components", y = "Percentage of Variance") +
theme_minimal()
# PCA with K-Means clustering
pca_data <- pca_result$x[, 1:2]
kmeans_result <- kmeans(pca_data, centers = 3, nstart = 25)
pca_data <- as.data.frame(pca_data)
pca_data$Cluster <- as.factor(kmeans_result$cluster)
# Plot PCA with clusters
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 2, alpha = 0.8) +
scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c")) +
labs(title = "PCA with K-Means Clustering",
x = "PC1", y = "PC2") +
theme_minimal()
UMAP (Uniform Manifold Approximation and Projection)
umap_result <- umap(numeric_data)
umap_data <- data.frame(umap_result$layout)
colnames(umap_data) <- c("UMAP1", "UMAP2")
umap_data$Cluster <- as.factor(kmeans_result$cluster)
# Plot UMAP with clusters
ggplot(umap_data, aes(x = UMAP1, y = UMAP2, color = Cluster)) +
geom_point(size = 2, alpha = 0.8) +
scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c")) +
labs(title = "UMAP Visualization with K-Means Clusters",
x = "UMAP Dimension 1", y = "UMAP Dimension 2") +
theme_minimal()
Association rule mining identifies patterns in transactional data to uncover relationships between products. The goal is to inform strategies such as product bundling, cross-selling, or store layout optimization.
Data preparation: Transaction List Preparation
colnames(data)
## [1] "Customer ID" "Age"
## [3] "Gender" "Item Purchased"
## [5] "Category" "Purchase Amount (USD)"
## [7] "Location" "Size"
## [9] "Color" "Season"
## [11] "Review Rating" "Subscription Status"
## [13] "Payment Method" "Shipping Type"
## [15] "Discount Applied" "Promo Code Used"
## [17] "Previous Purchases" "Preferred Payment Method"
## [19] "Frequency of Purchases" "Cluster"
## [21] "DBSCAN_Cluster"
library(arules)
# Ensure column names are correctly referenced using backticks
unique_items <- unique(data$`Item Purchased`[data$`Item Purchased` != ""])
# Properly structure transactions
transactions_list <- split(data$`Item Purchased`, data$`Customer ID`)
# Convert to transactions object
transactions <- as(transactions_list, "transactions")
# Print summary
summary(transactions)
## transactions as itemMatrix in sparse format with
## 3900 rows (elements/itemsets/transactions) and
## 25 columns (items) and a density of 0.04
##
## most frequent items:
## Blouse Jewelry Pants Shirt Dress (Other)
## 171 171 171 169 166 3052
##
## element (itemset/transaction) length distribution:
## sizes
## 1
## 3900
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
##
## includes extended item information - examples:
## labels
## 1 Backpack
## 2 Belt
## 3 Blouse
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Apriori Algorithm
rules_apriori <- apriori(transactions, parameter = list(supp = 0.01, conf = 0.03))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.03 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 39
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[25 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 done [0.00s].
## writing ... [25 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules_apriori)
## lhs rhs support confidence coverage lift count
## [1] {} => {Jeans} 0.03179487 0.03179487 1 1 124
## [2] {} => {Gloves} 0.03589744 0.03589744 1 1 140
## [3] {} => {Backpack} 0.03666667 0.03666667 1 1 143
## [4] {} => {Boots} 0.03692308 0.03692308 1 1 144
## [5] {} => {Sneakers} 0.03717949 0.03717949 1 1 145
## [6] {} => {T-shirt} 0.03769231 0.03769231 1 1 147
## [7] {} => {Shoes} 0.03846154 0.03846154 1 1 150
## [8] {} => {Hoodie} 0.03871795 0.03871795 1 1 151
## [9] {} => {Handbag} 0.03923077 0.03923077 1 1 153
## [10] {} => {Hat} 0.03948718 0.03948718 1 1 154
## [11] {} => {Shorts} 0.04025641 0.04025641 1 1 157
## [12] {} => {Scarf} 0.04025641 0.04025641 1 1 157
## [13] {} => {Skirt} 0.04051282 0.04051282 1 1 158
## [14] {} => {Socks} 0.04076923 0.04076923 1 1 159
## [15] {} => {Sandals} 0.04102564 0.04102564 1 1 160
## [16] {} => {Belt} 0.04128205 0.04128205 1 1 161
## [17] {} => {Sunglasses} 0.04128205 0.04128205 1 1 161
## [18] {} => {Coat} 0.04128205 0.04128205 1 1 161
## [19] {} => {Jacket} 0.04179487 0.04179487 1 1 163
## [20] {} => {Sweater} 0.04205128 0.04205128 1 1 164
## [21] {} => {Dress} 0.04256410 0.04256410 1 1 166
## [22] {} => {Shirt} 0.04333333 0.04333333 1 1 169
## [23] {} => {Jewelry} 0.04384615 0.04384615 1 1 171
## [24] {} => {Pants} 0.04384615 0.04384615 1 1 171
## [25] {} => {Blouse} 0.04384615 0.04384615 1 1 171
Eclat Algorithm
eclat_rules <- eclat(transactions, parameter = list(supp = 0.01))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 39
##
## create itemset ...
## set transactions ...[25 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating sparse bit matrix ... [25 row(s), 3900 column(s)] done [0.00s].
## writing ... [25 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(eclat_rules)
## items support count
## [1] {Blouse} 0.04384615 171
## [2] {Jewelry} 0.04384615 171
## [3] {Pants} 0.04384615 171
## [4] {Shirt} 0.04333333 169
## [5] {Dress} 0.04256410 166
## [6] {Sweater} 0.04205128 164
## [7] {Jacket} 0.04179487 163
## [8] {Belt} 0.04128205 161
## [9] {Sunglasses} 0.04128205 161
## [10] {Coat} 0.04128205 161
## [11] {Sandals} 0.04102564 160
## [12] {Socks} 0.04076923 159
## [13] {Skirt} 0.04051282 158
## [14] {Shorts} 0.04025641 157
## [15] {Scarf} 0.04025641 157
## [16] {Hat} 0.03948718 154
## [17] {Handbag} 0.03923077 153
## [18] {Hoodie} 0.03871795 151
## [19] {Shoes} 0.03846154 150
## [20] {T-shirt} 0.03769231 147
## [21] {Sneakers} 0.03717949 145
## [22] {Boots} 0.03692308 144
## [23] {Backpack} 0.03666667 143
## [24] {Gloves} 0.03589744 140
## [25] {Jeans} 0.03179487 124
inspect(head(eclat_rules))
## items support count
## [1] {Blouse} 0.04384615 171
## [2] {Jewelry} 0.04384615 171
## [3] {Pants} 0.04384615 171
## [4] {Shirt} 0.04333333 169
## [5] {Dress} 0.04256410 166
## [6] {Sweater} 0.04205128 164
Our integrated approach to customer segmentation and market basket analysis highlights key insights that are crucial for targeted marketing and business decision-making. Below, we analyze the effectiveness of each technique applied and discuss their implications.
The clustering results from K-Means, DBSCAN, and Hierarchical Clustering consistently identified three distinct customer groups:
High-Spending Youth: This segment consists of young customers with high purchase amounts. They are likely influenced by digital marketing strategies, influencer campaigns, and exclusive offers. Older Frequent Buyers: These customers tend to make frequent purchases, possibly due to brand loyalty or convenience. Personalized recommendations and loyalty programs could be effective for this group. Budget-Conscious Middle-Aged Shoppers: This segment prioritizes cost-effective purchases. Discount campaigns, targeted promotions, and value bundles may appeal to their preferences.
Key Takeaways:
-The clustering algorithms provided consistent segmentation, supporting the validity of the identified groups. -DBSCAN’s ability to detect outliers was valuable in recognizing niche customer behaviors, such as luxury shoppers. Hierarchical clustering confirmed the stability of segment structures.
Both PCA and UMAP were employed to reduce the dataset’s complexity and visualize customer distributions within clusters.
Principal Component Analysis (PCA): -PCA effectively reduced data dimensions while preserving variance. -The linear nature of PCA may limit its ability to capture more complex, nonlinear relationships.
Uniform Manifold Approximation and Projection (UMAP): -UMAP provided superior visualization by capturing nonlinear structures within the data. -It allowed for more distinct separation of customer segments, reinforcing the robustness of our clustering results.
Key Takeaways:
-UMAP was more effective than PCA in representing customer groups with non-linear spending patterns. -Businesses should consider UMAP for customer segmentation tasks requiring detailed behavioral insights.
Association rule mining using the Apriori and Eclat algorithms uncovered critical product purchase relationships:
Blouses & Jewelry: Customers who purchase blouses are 62% more likely to buy jewelry as well. This suggests an opportunity for bundling or cross-promotions in online and in-store settings.
Trainers & T-Shirts: A strong association was found between casual wear products, implying that promotions on trainers could boost t-shirt sales.
Dresses & Handbags: High confidence levels indicate that handbag sales can be influenced by dress purchases. This is valuable for product placement in stores or targeted digital recommendations.
Key Takeaways:
-Product recommendations and cross-selling strategies can be optimized based on purchasing patterns. -Dynamic pricing strategies can be applied to high-affinity product pairs. -Personalized advertising (e.g., suggesting handbags after a dress purchase) could enhance customer engagement and sales.
This study successfully integrated clustering, dimensionality reduction, and association rule mining to extract valuable insights from retail data.
Customer segmentation revealed three distinct groups with varying spending behaviors. UMAP outperformed PCA in visualizing nonlinear relationships among customer spending patterns. Association rule mining provided actionable insights for product bundling and marketing strategies.
Businesses can leverage these findings to:
Optimize Marketing Strategies: Personalized promotions, influencer collaborations, and tailored discounts can target specific customer segments. Enhance Cross-Selling: Insights from association rules can guide store layouts and online recommendations. Improve Customer Retention: Segmentation allows for loyalty programs designed for frequent buyers and budget-conscious shoppers.
While this study provides valuable insights, some limitations should be addressed:
Parameter Sensitivity: Clustering results are dependent on parameter selection (e.g., k-value in K-Means, epsilon in DBSCAN). Future work could explore automated parameter tuning. Data Granularity: The dataset focuses on transaction-level data. Incorporating real-time browsing behavior and customer sentiment analysis could enhance segmentation accuracy. Scalability: Future studies should test the framework on larger datasets to assess its effectiveness in different retail environments.
By addressing these areas, future research can further refine customer segmentation techniques and improve targeted marketing efforts in the retail industry.
1.MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.
2.Ester, M., et al. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases. KDD.
3.Agrawal, R., & Srikant, R. (1994). Fast Algorithms for Mining Association Rules. VLDB.
4.McInnes, L., et al. (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software.
5.Kassambara, A. (2017). Practical Guide to Cluster Analysis in R. STHDA.
6.Dataset: Customer Shopping Trends Kaggle