1 Dataset Used

The dataset used in this project is the UCI Online Retail Dataset, which contains transactional data collected from a UK-based online retail store.

Each row in the dataset represents a single product purchase made by a customer. The dataset includes information such as: - Invoice number - Product description - Quantity purchased - Invoice date - Unit price - Customer ID - Country

Since the dataset is transactional in nature, the same customer may appear multiple times across different invoices. Therefore, the data must be transformed into a customer-level format before applying machine learning techniques.

2 Machine Learning Task

The objective of this project is Customer Segmentation.

Customer segmentation is an unsupervised learning problem because the dataset does not contain predefined labels such as “high-value” or “low-value” customers. Instead, the goal is to allow the algorithm to automatically discover patterns in customer purchasing behavior.

Based on these patterns, customers are grouped into meaningful segments that can later be interpreted from a business perspective.

3 Possible Algorithms

The following clustering algorithms were considered:

4 Algorithm Chosen and Justification

K-Means Clustering was selected as the primary algorithm for this project.

The reasons for choosing K-Means are: - It is computationally efficient and works well with large datasets - It produces clearly separable clusters that are easy to interpret - It is widely used in real-world customer segmentation problems

Given the nature of the Online Retail dataset and the objective of grouping customers based on purchasing behavior, K-Means is a suitable and practical choice.

5 Mathematical Explanation of K-Means Clustering

K-Means clustering is an unsupervised learning algorithm that partitions a dataset into K distinct clusters based on similarity between data points.

Each cluster is represented by a centroid, which is the mean position of all data points assigned to that cluster. The objective of K-Means is to group data points such that observations within the same cluster are as similar as possible, while observations in different clusters are as dissimilar as possible.

Mathematically, K-Means attempts to minimize the Within-Cluster Sum of Squares (WCSS), also known as cluster variance. This objective function is defined as:

\[ \underset{C_1, C_2, \dots, C_K}{\arg\min} \sum_{k=1}^{K} \sum_{x_i \in C_k} \| x_i - \mu_k \|^2 \]

where: - \(K\) represents the number of clusters - \(C_k\) denotes the set of data points belonging to the k-th cluster - \(x_i\) represents an individual data point - \(\mu_k\) is the centroid (mean) of cluster \(C_k\) - \(\| x_i - \mu_k \|^2\) represents the squared Euclidean distance between a data point and its cluster centroid

The K-Means algorithm follows an iterative optimization process consisting of the following steps:

  1. Initialization:
    K centroids are initialized randomly from the dataset.

  2. Assignment Step:
    Each data point is assigned to the nearest centroid based on Euclidean distance.

  3. Update Step:
    New centroids are computed by taking the mean of all data points assigned to each cluster.

  4. Convergence:
    Steps 2 and 3 are repeated until cluster assignments no longer change or the reduction in WCSS becomes negligible.

In this project, each customer is represented in a three-dimensional feature space using the following features: - Total Spend - Total Quantity - Frequency of purchases

K-Means clustering groups customers by minimizing the distance between customers with similar purchasing behavior in this feature space. As a result, customers are segmented into three meaningful clusters representing low-value, medium-value, and high-value customers.

This mathematical formulation ensures that customers within the same cluster exhibit similar purchasing patterns, while customers across different clusters demonstrate significantly different behavior.

6 Loading Required Libraries

The following R libraries are used in this project:

Interpretation:
These libraries are used for data manipulation (dplyr), visualization (ggplot2), clustering (cluster), and cluster visualization (factoextra)

7 Loading the Dataset

# Load the dataset
online_retail <- read.csv("data/processed/cleaned_online_retail.csv")


# View structure
str(online_retail)
## 'data.frame':    4338 obs. of  10 variables:
##  $ CustomerID    : int  12346 12347 12348 12349 12350 12352 12353 12354 12355 12356 ...
##  $ TotalSpend    : num  11.25 8.37 7.49 7.47 5.82 ...
##  $ TotalQuantity : int  74215 2458 2341 631 197 536 20 530 240 1591 ...
##  $ Frequency     : int  1 7 4 1 1 8 1 1 1 3 ...
##  $ Recency       : int  28090141 161881 6478621 1565941 26772541 3103981 17607781 20043541 18486061 1915801 ...
##  $ M_Score       : int  3 3 3 3 1 3 1 2 2 3 ...
##  $ F_Score       : int  1 3 2 1 1 3 1 1 1 2 ...
##  $ R_Score       : int  1 3 2 3 1 2 1 1 1 3 ...
##  $ RFM_Score     : int  5 9 7 7 3 8 3 4 4 8 ...
##  $ Customer_Label: chr  "Medium" "High" "High" "High" ...
# View first few rows
head(online_retail)
##   CustomerID TotalSpend TotalQuantity Frequency  Recency M_Score F_Score
## 1      12346  11.253955         74215         1 28090141       3       1
## 2      12347   8.368925          2458         7   161881       3       3
## 3      12348   7.494564          2341         4  6478621       3       2
## 4      12349   7.472245           631         1  1565941       3       1
## 5      12350   5.815324           197         1 26772541       1       1
## 6      12352   7.826858           536         8  3103981       3       3
##   R_Score RFM_Score Customer_Label
## 1       1         5         Medium
## 2       3         9           High
## 3       2         7           High
## 4       3         7           High
## 5       1         3            Low
## 6       2         8           High

What this code does:
This code loads the cleaned Online Retail dataset into R and displays its structure and sample records.

Interpretation:
The dataset contains customer transaction features that will be used for customer segmentation.
Basic inspection confirms that the data is properly loaded and ready for further analysis.

8 Exploratory Data Analysis (EDA)

Before applying clustering, it is important to understand the dataset and ensure its quality.

One of the first checks performed is to identify missing values. Clustering algorithms such as K-Means cannot handle missing data, so this step ensures that the dataset is suitable for modeling.

8.1 Dataset Summary

What this step does:
This step provides a statistical summary of all numerical features in the dataset to understand data distribution and scale.

colSums(is.na(online_retail))
##     CustomerID     TotalSpend  TotalQuantity      Frequency        Recency 
##              0              0              0              0              0 
##        M_Score        F_Score        R_Score      RFM_Score Customer_Label 
##              0              0              0              0              0

Interpretation:
The results indicate whether missing values exist.
Since clustering algorithms cannot handle missing data, this step ensures data quality.

9 Feature Distribution Analysis

The distribution of customer spending is visualized using a histogram.

This visualization helps identify spending patterns among customers. The distribution is right-skewed, meaning that: - Most customers have low spending - A small number of customers contribute very high revenue

This observation highlights the presence of high-value customers and confirms the need for feature scaling before clustering.

# Feature Distribution – Total Spend

# What this step does:
# Visualizes how customer spending is distributed across the dataset

hist(
  online_retail$TotalSpend,
  main = "Distribution of Total Spend",
  xlab = "Total Spend",
  col = "skyblue",
  border = "white"
)

Interpretation:
The histogram shows a right-skewed distribution of customer spending, where most customers have low total spend while a small number of customers contribute significantly higher revenue.
This indicates the presence of high-value customers and justifies the need for feature scaling before applying K-Means clustering.

10 Feature Engineering and Scaling

For customer segmentation, three key behavioral features are selected:

These features collectively represent how valuable, active, and engaged a customer is.

Since K-Means is a distance-based algorithm, all features are scaled using standardization. This ensures that no single feature dominates the clustering process due to larger numeric values.

10.1 What this step does:

This step creates meaningful features from raw transaction data. These features are commonly used for customer segmentation in retail analytics.

# Select numeric features
features <- online_retail[, c("TotalSpend", "TotalQuantity", "Frequency")]

# Scale features
scaled_features <- scale(features)

# View summary
summary(scaled_features)
##    TotalSpend       TotalQuantity        Frequency       
##  Min.   :-4.00411   Min.   :-0.23588   Min.   :-0.42505  
##  1st Qu.:-0.68559   1st Qu.:-0.20437   1st Qu.:-0.42505  
##  Median :-0.06218   Median :-0.16097   Median :-0.29514  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.65411   3rd Qu.:-0.03935   3rd Qu.: 0.09457  
##  Max.   : 4.73104   Max.   :38.78727   Max.   :26.59496

Interpretation:
After scaling, all features have comparable ranges with mean close to zero and unit variance.
This ensures that K-Means clustering is not biased toward variables with larger numeric values.

# Elbow Method to determine optimal k
wss <- numeric()

for (k in 1:10) {
  kmeans_model <- kmeans(scaled_features, centers = k, nstart = 25)
  wss[k] <- kmeans_model$tot.withinss
}

plot(
  1:10, wss,
  type = "b",
  pch = 19,
  frame = FALSE,
  xlab = "Number of Clusters (k)",
  ylab = "Total Within-Cluster Sum of Squares",
  main = "Elbow Method for Optimal k"
)

## Determining the Optimal Number of Clusters

The Elbow Method is used to determine the optimal number of clusters (k). It measures how the total within-cluster variance changes as the number of clusters increases.

The “elbow point” represents a balance between model simplicity and clustering performance.

10.2 Interpretation

Interpretation:
The elbow point in the graph indicates the optimal number of clusters.
After this point, adding more clusters does not significantly reduce within-cluster variance.
Based on the elbow curve, k = 3 clusters is selected for this project.

# Apply K-Means clustering
set.seed(123)
kmeans_final <- kmeans(scaled_features, centers = 3, nstart = 25)

# Add cluster labels to dataset
online_retail$Cluster <- as.factor(kmeans_final$cluster)

# View cluster sizes
table(online_retail$Cluster)
## 
##    1    2    3 
## 2758 1556   24

Interpretation:
Customers have been successfully grouped into three distinct clusters.
Each cluster represents a different purchasing behavior segment.

# Cluster visualization
ggplot(
  online_retail,
  aes(x = TotalSpend, y = TotalQuantity, color = Cluster)
) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Customer Segmentation using K-Means",
    x = "Total Spend",
    y = "Total Quantity"
  ) +
  theme_minimal()

Interpretation:
The scatter plot shows clear separation between customer segments.
High-spending and high-quantity customers form distinct clusters, confirming effective segmentation.

# Cluster profiling
aggregate(
  cbind(TotalSpend, TotalQuantity, Frequency) ~ Cluster,
  data = online_retail,
  mean
)
##   Cluster TotalSpend TotalQuantity Frequency
## 1       1   5.844509      268.7814  1.794416
## 2       2   7.851925     2044.4299  7.676735
## 3       3  11.100062    51890.8333 68.250000

Interpretation:
## Cluster Interpretation

Based on the K-Means clustering results, customers are grouped into three distinct segments according to their purchasing behavior.

10.3 Cluster 1 – Low-Value Customers

Customers in this cluster have the lowest total spending, purchase fewer quantities, and have lower purchase frequency. These customers may be occasional buyers or new customers.

10.4 Cluster 2 – Medium-Value Customers

This cluster represents regular customers with moderate spending and purchase frequency. These customers form a stable customer base and have potential to be converted into high-value customers.

10.5 Cluster 3 – High-Value Customers

Customers in this cluster show the highest spending, purchase large quantities, and shop frequently. These are loyal and premium customers who contribute significantly to overall revenue.

11 Business Insights

The customer segments identified through clustering can be used to support business decision-making:

12 Conclusion

This project demonstrates an end-to-end machine learning workflow using R on the UCI Online Retail dataset. Customer segmentation was performed using K-Means clustering, an unsupervised learning algorithm.

The results provide meaningful insights into customer behavior and can help businesses improve marketing strategies, customer retention, and revenue growth.