Introduction

Customer segmentation is the process of breaking up a client base into discrete groups according to shared characteristics. Businesses can improve consumer satisfaction, optimize resource allocation, and customize marketing techniques with this method.

Literature Review

Some researches were conducted before which are also related to clustering the online retailing linked datasets. One of them is “Clustering retail stores for inventory transshipment” by Emily C. Griffin a, Burcu B. Keskin b, Arthur W. Allaway c. In their research, they present a novel demand correlation-based method that significantly increases profit over conventional, distance-based methods for lowering network complexity in inventory and transshipment issues.They defined two transshipment problems, MN-R and MN-P, that are combined with distance and demand correlation-based clustering approaches.

In their literatiure review, they mention: Axsäter (1990), Robinson (1990), and Tagaras (1989) were among the first authors to examine a periodic review inventory system with several locations for reactive transshipments. Since then, the issue has been thoroughly examined in numerous settings. A fresh version of the problem with local decision-making in a two-location situation was presented by Rudi, Kapur, and Pyke (2001). The dynamic multi-location transshipment problem with fixed transshipment costs is NP-Hard, as demonstrated by Herer & Tzur (2003), who also offered an accurate and heuristic approach for solving it. In their study of the multi-location multi-period problem, Herer, Tzur, and Yücesan (2006) showed how to compute the order-up-to levels and then optimize the network flow of the transshipment problem.

Data

The dataset used for this analysis is the Online Retail Dataset.

Link to dataset -> https://www.kaggle.com/datasets/yasserh/customer-segmentation-dataset It contains transactional data of a UK-based online retail store from 2010 to 2011. The dataset includes the following variables:

• InvoiceNo: Unique identifier for each transaction.

• StockCode: Product code.

• Description: Product description.

• Quantity: Quantity of items purchased.

• InvoiceDate: Date and time of the transaction.

• UnitPrice: Price per unit of the product.

• CustomerID: Unique identifier for each customer.

• Country: Country where the customer resides.

if (!require("readxl")) {install.packages("readxl")}
## Loading required package: readxl
## Warning: package 'readxl' was built under R version 4.4.2
if (!require("dplyr")) {install.packages("dplyr")}
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
if (!require("ggplot2")) {install.packages("ggplot2")}
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.2
library(readxl)
library(dplyr)
library(ggplot2)
cust_segm_data <- read_excel("C://Users/Admin/Desktop/OnlineRetail.xlsx")

summary(cust_segm_data)
##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##   InvoiceDate                       UnitPrice           CustomerID    
##  Min.   :2010-12-01 08:26:00.00   Min.   :-11062.06   Min.   :12346   
##  1st Qu.:2011-03-28 11:34:00.00   1st Qu.:     1.25   1st Qu.:13953   
##  Median :2011-07-19 17:17:00.00   Median :     2.08   Median :15152   
##  Mean   :2011-07-04 13:34:57.16   Mean   :     4.61   Mean   :15288   
##  3rd Qu.:2011-10-19 11:27:00.00   3rd Qu.:     4.13   3rd Qu.:16791   
##  Max.   :2011-12-09 12:50:00.00   Max.   : 38970.00   Max.   :18287   
##                                                       NA's   :135080  
##    Country         
##  Length:541909     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The summary of the dataset is shown in the above executed code’s output.

Data Preprocessing

Before clustering, data needs to be prepared and more specifically, preprocessed. ID column might have NA values which have to be avoided. Thereafter, the new column for total spending should be created, outliers should be removed and finally, the dataset should be filtered. Finally, we have a dataset that can be passed for scaling.

# Remove rows with missing CustomerID
data_clean <- subset(cust_segm_data, !is.na(CustomerID))

# Create a new column for Total Spending
data_clean <- data_clean %>%
  mutate(TotalSpending = Quantity * UnitPrice) %>%
  group_by(CustomerID) %>%
  summarize(
    TotalSpending = sum(TotalSpending, na.rm = TRUE),
    Transactions = n(),
    AverageBasketValue = TotalSpending / Transactions
  )

# Remove outliers using IQR method
quantiles <- data_clean %>%
  summarize(
    Q1 = quantile(TotalSpending, 0.25),
    Q3 = quantile(TotalSpending, 0.75),
    IQR = Q3 - Q1
  )

lower_bound <- quantiles$Q1 - 1.5 * quantiles$IQR
upper_bound <- quantiles$Q3 + 1.5 * quantiles$IQR

data_clean <- data_clean %>%
  filter(
    TotalSpending >= lower_bound & TotalSpending <= upper_bound
  )

Scaling

Scaling, or normalization, is the process of adjusting the values of numeric features so that they are on a similar scale. This is important in many machine learning algorithms, especially those based on distance (e.g., K-means clustering, K-nearest neighbors).

data_scaled <- scale(data_clean %>% select(-CustomerID))

Clustering (K-means)

Clustering is a technique used to group similar data points together, identifying patterns or structures within the data. In clustering, the algorithm assigns each data point to one of the groups (or clusters) based on certain similarities or distances between data points. In this project, K-means technique is used. Moreover, elbow method is used to determine the optimal number of clusters.

wss <- sapply(1:10, function(k) {
  kmeans(data_scaled, centers = k, nstart = 10)$tot.withinss
})

# Plot the Elbow Method
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters",
     ylab = "Total Within-Cluster Sum of Squares (WSS)",
     main = "Elbow Method for Optimal Clusters")

This Graph shows that after 4 clusters, it is not benefitial to cluster the dataset.

Why 4 Clusters?

When applying the Elbow Method to the dataset:

For 1 to 10 clusters, WSS is computed. The plot typically shows:

* A steep decrease in WSS from k = 1 to k = 4.

* After k = 4, the reduction in WSS becomes much smaller, indicating that the additional clusters do not significantly improve the grouping.

Thus, 4 clusters strike a balance between minimizing WSS: Ensuring data points are tightly grouped within clusters and avoiding Overfitting: Preventing too many small, meaningless clusters.

set.seed(42) # For reproducibility
optimal_clusters <- 4 # Replace with optimal number from the Elbow method
kmeans_result <- kmeans(data_scaled, centers = optimal_clusters, nstart = 10)

# Add cluster assignment to the original data
data_clean$Cluster <- as.factor(kmeans_result$cluster)

Plotting Clusters

ggplot(data_clean, aes(x = TotalSpending, y = AverageBasketValue, color = Cluster)) +
  geom_point(alpha = 0.6, size = 3) +
  labs(
    title = "Customer Segmentation Clusters",
    x = "Total Spending",
    y = "Average Basket Value",
    color = "Cluster"
  ) +
  theme_minimal()

From the graph, it can be observed that the 4 clusters made the dataset more easier visualize. If there were more unnecessary clusters, points would be more seperated and there would be no relationship between them. That is why, 4 clusters is enough.

Results

Customers are grouped into 4 clusters, with each cluster representing a distinct customer segment. Each customer is assigned a cluster label, and the following cluster-level insights are derived:

Cluster 1: Low spenders with moderate transaction frequency.

Cluster 2: High spenders with frequent transactions.

Cluster 3: Very low spenders with minimal transactions.

Cluster 4: Elite customers with very high spending and frequent transactions.

tail(data_clean[2:5])
## # A tibble: 6 × 4
##   TotalSpending Transactions AverageBasketValue Cluster
##           <dbl>        <int>              <dbl> <fct>  
## 1         174.             9              19.3  3      
## 2         181.            10              18.1  3      
## 3          80.8            7              11.5  3      
## 4         177.            13              13.6  3      
## 5        2095.           756               2.77 1      
## 6        1837.            70              26.2  2

This shows the last 6 columns of the dataset with the corresponding clusters.

Conclusion

This project was conducted in order to observe the real application of the clustering in order to create a more structured dataset and easier visualization of the dataset.