Paper represnt:

A three methods of unsupervised learning used in this paper using clustering, dimensional reduction and association rules, to discover the Online Retail dataset in Business area.

Dataset Information

This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Data Import & Preprocessing

libraries used

library(tidyverse)    # Data manipulation and visualization
library(cluster)      # Clustering algorithms (PAM, CLARA)
library(factoextra)   # Visualization for clustering
library(arules)       # Association rule mining
library(arulesViz)    # Visualizing association rules
library(NbClust)      # Optimal cluster number determination
library(GGally)       # Correlation plots

We load essential libraries for data manipulation, clustering, association rule, dimensional reduction and visualization.

Data importing:

# Download and load the dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
download.file(url, destfile = "Online_Retail.xlsx", mode = "wb")
data <- readxl::read_excel("Online_Retail.xlsx")

# Inspect data structure
str(data)

## tibble [541,909 × 8] (S3: tbl_df/tbl/data.frame)
##  $ InvoiceNo  : chr [1:541909] "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr [1:541909] "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr [1:541909] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num [1:541909] 6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: POSIXct[1:541909], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice  : num [1:541909] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num [1:541909] 17850 17850 17850 17850 17850 ...
##  $ Country    : chr [1:541909] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

head(data)

## # A tibble: 6 × 8
##   InvoiceNo StockCode Description         Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>                  <dbl> <dttm>                  <dbl>
## 1 536365    85123A    WHITE HANGING HEAR…        6 2010-12-01 08:26:00      2.55
## 2 536365    71053     WHITE METAL LANTERN        6 2010-12-01 08:26:00      3.39
## 3 536365    84406B    CREAM CUPID HEARTS…        8 2010-12-01 08:26:00      2.75
## 4 536365    84029G    KNITTED UNION FLAG…        6 2010-12-01 08:26:00      3.39
## 5 536365    84029E    RED WOOLLY HOTTIE …        6 2010-12-01 08:26:00      3.39
## 6 536365    22752     SET 7 BABUSHKA NES…        2 2010-12-01 08:26:00      7.65
## # ℹ 2 more variables: CustomerID <dbl>, Country <chr>

Inspect variables: - InvoiceNo: Transaction ID - StockCode: Product ID - Description: Product name - Quantity: Number of items purchased - InvoiceDate: Timestamp - UnitPrice: Price per item - CustomerID: Unique customer identifier - Country: Customer location

Removing missing data

# Remove missing values
# Remove missing CustomerIDs and invalid quantities/prices
data_clean <- data %>%
  filter(!is.na(CustomerID), Quantity > 0, UnitPrice > 0)
dim(data_clean)

## [1] 397884      8

create a customer level dataset for clustering by grouping them by customerID

# Create a customer-level dataset for clustering
customer_data <- data_clean %>%
  group_by(CustomerID) %>%
  summarise(
    TotalAmount = sum(Quantity * UnitPrice),  # Total spending
    Frequency = n(),                          # Number of transactions
    AvgQuantity = mean(Quantity)              # Average items per transaction

  )
dim(customer_data)

## [1] 4338    4

Then Scale the data for clustering, for easier calculation to determine the similarity between data points, avoiding bias and Interpretability.

# Scale the data for clustering
scaled_data <- scale(customer_data[, -1])     # Remove CustomerID before scaling
summary(scaled_data)

##   TotalAmount         Frequency         AvgQuantity      
##  Min.   :-0.22811   Min.   :-0.39653   Min.   :-0.03662  
##  1st Qu.:-0.19433   1st Qu.:-0.32660   1st Qu.:-0.03246  
##  Median :-0.15349   Median :-0.22170   Median :-0.02914  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.04367   3rd Qu.: 0.03619   3rd Qu.:-0.02526  
##  Max.   :30.94278   Max.   :33.89766   Max.   :61.63170

Clustering

K-Means Clustering:-

To determine the optimal number of clusters we use the two common method Elbow method and Silhouette Method.

*Elbow method

# Determine the optimal number of clusters using the elbow method
fviz_nbclust(scaled_data, kmeans, method = "wss") + 
  labs(title = "Elbow Method for K-Means") + theme_minimal()

*Silhouette Method

# Silhouette Method
fviz_nbclust(scaled_data, kmeans, method = "silhouette") +
  labs(title = "Silhouette Method for K-Means") + theme_minimal()

As we can absorb from the graphs the optimal number of clusters are 4 in elbow method and 3 in Silhouette Method so we choose the number of clusters as k=4

k-mean clustering using k=4

# Perform K-Means clustering
set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = 4, nstart = 25)
fviz_cluster(kmeans_result, data = scaled_data, geom = "point") +
  labs(title = "K-Means Clustering (k=4)") + theme_minimal()

To find clusters size in k-means

table(kmeans_result$cluster)

## 
##    1    2    3    4 
## 4015  310    1   12

PAM Clustering (Partitioning Around Medoids)

# Perform PAM clustering
set.seed(123)
pam_result <- pam(scaled_data, k = 4)
fviz_cluster(pam_result, data = scaled_data, geom = "point") +
  labs(title = "PAM Clustering (k=4)") + theme_minimal()

To find clusters size in PAM

table(pam_result$cluster)

## 
##    1    2    3    4 
##  226  597 2309 1206

Using the Statistics in Clustered Groups after adding cluster lables to the orginal data

# To add cluster lables to the orginal data
customer_data$Cluster <- kmeans_result$cluster
# Summary statistics for each cluster
cluster_stats <- customer_data %>%
  group_by(Cluster) %>%
  summarise(
    MeanAmount = mean(TotalAmount),
    MedianFrequency = median(Frequency),
    AvgQuantity = mean(AvgQuantity)
  )
cluster_stats

## # A tibble: 4 × 4
##   Cluster MeanAmount MedianFrequency AvgQuantity
##     <int>      <dbl>           <dbl>       <dbl>
## 1       1      1067.              36        21.2
## 2       2      9580.             360        27.9
## 3       3     77184.               1     74215  
## 4       4    131726.            1738      2309.

It shows that cluster 1 have a low value amount, low frequency and low quantity customers buying.(low-value customers)

It shows that cluster 2 have a average value amount, average frequency and low quantity customers buying.(Average-value customers)

It shows that cluster 3 have a high value amount, very-low frequency and massive quantity customers buying, this is such whole-sale.(wholesale-value customers)

It shows that cluster 4 have a very-high value amount, high frequency and average quantity customers buying.(high-value customers)

Labeling the clusters:

cluster_stats$Label <- c("low-value customers", "Average-value customers", "wholesale-value customers", "high-value customers")

Visualize the clusters in graphs:

ggplot(cluster_stats, aes(x = Cluster, y = MeanAmount, fill = Label)) +
  geom_bar(stat = "identity") +
  labs(title = "Mean Total Amount by Cluster", y = "Mean Amount", x = "Cluster") +
  theme_minimal()

# Visualize the dissimilarity matrix
diss_matrix <- daisy(scaled_data)
fviz_dist(diss_matrix)

Hierarchical Clustering

# Perform hierarchical clustering
hclust_result <- hclust(diss_matrix, method = "ward.D2")
plot(hclust_result, cex = 0.6)

COrrelation between variables

# Check correlation between variables
ggpairs(customer_data[, -1])

Dimensional Reduction

*Principal Component Analysis (PCA): Linear combinations of the original features that capture the maximum variance in the data. Eigenvalues: The amount of variance explained by each (Principal Components )PC.

The biplot colors the data points based on their cluster assignments from the K-Means clustering result (kmeans_result$cluster).

pca_result <- prcomp(scaled_data, scale = TRUE)
fviz_eig(pca_result, addlabels = TRUE) + 
  labs(title = "PCA - Variance Explained")

# Biplot of PCA
fviz_pca_biplot(pca_result, col.ind = kmeans_result$cluster, 
                label = "var", repel = TRUE) +
  labs(title = "PCA Biplot")

From the grap we can see there is high correlation between feactures. As we can see the importance of the TotalAmount in the graph as it is longer than the rest.

*Multidimensional Scaling (MDS):

Performs Classical Multidimensional Scaling (also known as Principal Coordinates Analysis).An object contains the coordinates of the observations in the 2D MDS space. Observations that are close to each other in the plot are similar (have small dissimilarities), while observations that are far apart are dissimilar (have large dissimilarities). K=2, Specifies that the result should be in 2 dimensions (for easy visualization). Using (kmeans_result$cluster), it colors the points based on their cluster assignments from the K-Means clustering result.

# Compute dissimilarity matrix
diss_matrix <- dist(scaled_data)

# Classical MDS
mds_result <- cmdscale(diss_matrix, k = 2)
plot(mds_result, col = kmeans_result$cluster, pch = 19,
     main = "MDS Plot with K-Means Clusters")
legend("topright", legend = unique(kmeans_result$cluster), 
       col = unique(kmeans_result$cluster), pch = 19, title = "Cluster")

The blue observations that are far apart are dissimilar (have large dissimilarities). While the black and red Observations that are close to each other in the plot are similar (have small dissimilarities).

Showing the clusters with their labeled cluster.

Market Basket Analysis (MBA):

Manual Categorization:

Assigns categories based on the detected keywords in the descreption with support = 0.01: Specifies the minimum support threshold (1%) and confidence = 0.5: Specifies the minimum confidence threshold (50%). Products are grouped into 4 categories. Rules like {LIGHTING} => {BAGS/BOTTLES} indicate products often bought together.

# Manually assign categories based on product descriptions
data_clean <- data_clean %>%
  mutate(
    Category = case_when(
      str_detect(Description, "BAG|BOTTLE") ~ "BAGS/BOTTLES",
      str_detect(Description, "PEN|PENCIL") ~ "STATIONERY",
      str_detect(Description, "LIGHT") ~ "LIGHTING",
      TRUE ~ "OTHER"
    )
  )

# Convert to transactions
transactions_manual <- as(split(data_clean$Category, data_clean$InvoiceNo), "transactions")

# Mine rules with support > 0.01 and confidence > 0.5
rules_manual <- apriori(transactions_manual, 
                        parameter = list(support = 0.01, confidence = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 185 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4 item(s), 18532 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [21 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(sort(rules_manual, by = "lift"), 10))

##      lhs                                  rhs            support    confidence
## [1]  {LIGHTING, OTHER, STATIONERY}     => {BAGS/BOTTLES} 0.08504209 0.7752091 
## [2]  {LIGHTING, STATIONERY}            => {BAGS/BOTTLES} 0.08509605 0.7737978 
## [3]  {OTHER, STATIONERY}               => {BAGS/BOTTLES} 0.14321174 0.7281207 
## [4]  {STATIONERY}                      => {BAGS/BOTTLES} 0.14353551 0.7234158 
## [5]  {BAGS/BOTTLES, OTHER, STATIONERY} => {LIGHTING}     0.08504209 0.5938206 
## [6]  {BAGS/BOTTLES, STATIONERY}        => {LIGHTING}     0.08509605 0.5928571 
## [7]  {OTHER, STATIONERY}               => {LIGHTING}     0.10970214 0.5577503 
## [8]  {STATIONERY}                      => {LIGHTING}     0.10997194 0.5542562 
## [9]  {BAGS/BOTTLES, OTHER}             => {LIGHTING}     0.26127779 0.5120017 
## [10] {LIGHTING, OTHER}                 => {BAGS/BOTTLES} 0.26127779 0.6187069 
##      coverage  lift     count
## [1]  0.1097021 1.471944 1576 
## [2]  0.1099719 1.469265 1577 
## [3]  0.1966868 1.382534 2654 
## [4]  0.1984136 1.373601 2660 
## [5]  0.1432117 1.364330 1576 
## [6]  0.1435355 1.362116 1577 
## [7]  0.1966868 1.281457 2033 
## [8]  0.1984136 1.273429 2038 
## [9]  0.5103065 1.176347 4842 
## [10] 0.4222966 1.174782 4842

It shows that who buys {LIGHTING, OTHER, STATIONERY} also buys {BAGS/BOTTLES} with transactions frequency of 0.08504209 between the first and second item. suggest to make a promotion.

Support: The proportion of transactions that contain both the antecedent (left-hand side) and consequent (right-hand side) of the rule.

Confidence: The probability that the consequent is purchased given that the antecedent is purchased.

Lift: Indicates how much more likely the consequent is to be purchased when the antecedent is purchased, compared to its general purchase probability.

A lift value of:

1: Indicates no association between A and B (they are independent).

1: Indicates a positive association (A and B are likely to be purchased together).

< 1: Indicates a negative association (A and B are unlikely to be purchased together).

*Automated Categorization:

Using StockCode (unique product IDs) as categories to analyze relationships between specific products rather than broader categories.

Creating a list where each element corresponds to a transaction (invoice) and contains the StockCode values of products purchased in that transaction.

# Use StockCode as categories (unique product IDs)
transactions_auto <- as(split(data_clean$StockCode, data_clean$InvoiceNo), "transactions")

# Mine rules with lower thresholds (sparse data)
rules_auto <- apriori(transactions_auto, 
                      parameter = list(support = 0.005, confidence = 0.3))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 92 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3665 item(s), 18532 transaction(s)] done [0.18s].
## sorting and recoding items ... [1246 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.10s].
## writing ... [5491 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(sort(rules_auto, by = "lift"), 10))

##      lhs                                    rhs     support     confidence
## [1]  {23290, 23291}                      => {23289} 0.005180229 0.9696970 
## [2]  {23289, 23292}                      => {23291} 0.005018347 0.9587629 
## [3]  {23291, 23292}                      => {23289} 0.005018347 0.8532110 
## [4]  {23291}                             => {23289} 0.006313404 0.8125000 
## [5]  {23289}                             => {23291} 0.006313404 0.8297872 
## [6]  {23289, 23290}                      => {23291} 0.005180229 0.8135593 
## [7]  {22916, 22917, 22918, 22919, 22920} => {22921} 0.007122815 0.9166667 
## [8]  {23289, 23292}                      => {23290} 0.005018347 0.9587629 
## [9]  {22916, 22917, 22918, 22919}        => {22921} 0.007554500 0.9090909 
## [10] {22916, 22918, 22919, 22920}        => {22921} 0.007230736 0.9054054 
##      coverage    lift     count
## [1]  0.005342111 127.4498  96  
## [2]  0.005234190 123.3875  93  
## [3]  0.005881718 112.1398  93  
## [4]  0.007770343 106.7890 117  
## [5]  0.007608461 106.7890 117  
## [6]  0.006367365 104.7006  96  
## [7]  0.007770343 102.3353 132  
## [8]  0.005234190 102.1138  93  
## [9]  0.008309950 101.4896 140  
## [10] 0.007986186 101.0782 134

Ex: This represents the antecedent of the rule, meaning the items that are purchased together. In this case, products with StockCode 23290 and 23291 are purchased together.

This represents the consequent of the rule, meaning the item that is likely to be purchased when the antecedent items are purchased. In this case, the product with StockCode 23289 is likely to be purchased.

The support means appears in 0.518% of all transactions in the dataset. The Confidence: 0.9696970 means that 96.97% of the transactions that contain {23290, 23291} also contain {23289}.

Conclusion

Used Customer Segmentation (Clustering), Using K-Means, PAM customers were grouped into 4 clusters based on their purchasing behavior:

It shows that cluster 1 have a low value amount, low frequency and low quantity customers buying.(low-value customers)

It shows that cluster 2 have a average value amount, average frequency and low quantity customers buying.(Average-value customers)

It shows that cluster 3 have a high value amount, very-low frequency and massive quantity customers buying, this is such whole-sale.(wholesale-value customers)

It shows that cluster 4 have a very-high value amount, high frequency and average quantity customers buying.(high-value customers)

Business Implications can be done according to the business plan and strategy.

Dimensionality Reduction (PCA) revealed that: Key features like TotalAmount and Frequency strongly influence clustering.

Market Basket Analysis (MBA) reveals association rule mining, identified product relationships: Manual Categorization: Rules like {LIGHTING} => {BAGS/BOTTLES} suggest complementary products.

Automated Categorization: Rules like {23166} => {23165} reveal frequently co-purchased items.

Sources used:

https://archive.ics.uci.edu/dataset/352/online+retail Lectures provided by the Katarzyna Kopczewska, Associate professor in the classes.

Unsupervised Learning methods on Online Retail Dataset

Jamal Al-Shaweai

2025-02-09