The dataset used in this project is the UCI Online Retail Dataset, which contains transactional data collected from a UK-based online retail store.
Each row in the dataset represents a single product purchase made by a customer. The dataset includes information such as: - Invoice number - Product description - Quantity purchased - Invoice date - Unit price - Customer ID - Country
Since the dataset is transactional in nature, the same customer may appear multiple times across different invoices. Therefore, the data must be transformed into a customer-level format before applying machine learning techniques.
The objective of this project is Customer Segmentation.
Customer segmentation is an unsupervised learning problem because the dataset does not contain predefined labels such as “high-value” or “low-value” customers. Instead, the goal is to allow the algorithm to automatically discover patterns in customer purchasing behavior.
Based on these patterns, customers are grouped into meaningful segments that can later be interpreted from a business perspective.
The following clustering algorithms were considered:
K-Means Clustering was selected as the primary algorithm for this project.
The reasons for choosing K-Means are: - It is computationally efficient and works well with large datasets - It produces clearly separable clusters that are easy to interpret - It is widely used in real-world customer segmentation problems
Given the nature of the Online Retail dataset and the objective of grouping customers based on purchasing behavior, K-Means is a suitable and practical choice.
K-Means clustering is an unsupervised learning algorithm that partitions a dataset into K distinct clusters based on similarity between data points.
Each cluster is represented by a centroid, which is the mean position of all data points assigned to that cluster. The objective of K-Means is to group data points such that observations within the same cluster are as similar as possible, while observations in different clusters are as dissimilar as possible.
Mathematically, K-Means attempts to minimize the Within-Cluster Sum of Squares (WCSS), also known as cluster variance. This objective function is defined as:
\[ \underset{C_1, C_2, \dots, C_K}{\arg\min} \sum_{k=1}^{K} \sum_{x_i \in C_k} \| x_i - \mu_k \|^2 \]
where: - \(K\) represents the number of clusters - \(C_k\) denotes the set of data points belonging to the k-th cluster - \(x_i\) represents an individual data point - \(\mu_k\) is the centroid (mean) of cluster \(C_k\) - \(\| x_i - \mu_k \|^2\) represents the squared Euclidean distance between a data point and its cluster centroid
The K-Means algorithm follows an iterative optimization process consisting of the following steps:
Initialization:
K centroids are initialized randomly from the dataset.
Assignment Step:
Each data point is assigned to the nearest centroid based on Euclidean
distance.
Update Step:
New centroids are computed by taking the mean of all data points
assigned to each cluster.
Convergence:
Steps 2 and 3 are repeated until cluster assignments no longer change or
the reduction in WCSS becomes negligible.
In this project, each customer is represented in a three-dimensional feature space using the following features: - Total Spend - Total Quantity - Frequency of purchases
K-Means clustering groups customers by minimizing the distance between customers with similar purchasing behavior in this feature space. As a result, customers are segmented into three meaningful clusters representing low-value, medium-value, and high-value customers.
This mathematical formulation ensures that customers within the same cluster exhibit similar purchasing patterns, while customers across different clusters demonstrate significantly different behavior.
The following R libraries are used in this project:
Interpretation:
These libraries are used for data manipulation (dplyr),
visualization (ggplot2), clustering (cluster),
and cluster visualization (factoextra)
# Load the dataset
online_retail <- read.csv("data/processed/cleaned_online_retail.csv")
# View structure
str(online_retail)
## 'data.frame': 4338 obs. of 10 variables:
## $ CustomerID : int 12346 12347 12348 12349 12350 12352 12353 12354 12355 12356 ...
## $ TotalSpend : num 11.25 8.37 7.49 7.47 5.82 ...
## $ TotalQuantity : int 74215 2458 2341 631 197 536 20 530 240 1591 ...
## $ Frequency : int 1 7 4 1 1 8 1 1 1 3 ...
## $ Recency : int 28090141 161881 6478621 1565941 26772541 3103981 17607781 20043541 18486061 1915801 ...
## $ M_Score : int 3 3 3 3 1 3 1 2 2 3 ...
## $ F_Score : int 1 3 2 1 1 3 1 1 1 2 ...
## $ R_Score : int 1 3 2 3 1 2 1 1 1 3 ...
## $ RFM_Score : int 5 9 7 7 3 8 3 4 4 8 ...
## $ Customer_Label: chr "Medium" "High" "High" "High" ...
# View first few rows
head(online_retail)
## CustomerID TotalSpend TotalQuantity Frequency Recency M_Score F_Score
## 1 12346 11.253955 74215 1 28090141 3 1
## 2 12347 8.368925 2458 7 161881 3 3
## 3 12348 7.494564 2341 4 6478621 3 2
## 4 12349 7.472245 631 1 1565941 3 1
## 5 12350 5.815324 197 1 26772541 1 1
## 6 12352 7.826858 536 8 3103981 3 3
## R_Score RFM_Score Customer_Label
## 1 1 5 Medium
## 2 3 9 High
## 3 2 7 High
## 4 3 7 High
## 5 1 3 Low
## 6 2 8 High
What this code does:
This code loads the cleaned Online Retail dataset into R and displays
its structure and sample records.
Interpretation:
The dataset contains customer transaction features that will be used for
customer segmentation.
Basic inspection confirms that the data is properly loaded and ready for
further analysis.
Before applying clustering, it is important to understand the dataset and ensure its quality.
One of the first checks performed is to identify missing values. Clustering algorithms such as K-Means cannot handle missing data, so this step ensures that the dataset is suitable for modeling.
What this step does:
This step provides a statistical summary of all numerical features in
the dataset to understand data distribution and scale.
colSums(is.na(online_retail))
## CustomerID TotalSpend TotalQuantity Frequency Recency
## 0 0 0 0 0
## M_Score F_Score R_Score RFM_Score Customer_Label
## 0 0 0 0 0
Interpretation:
The results indicate whether missing values exist.
Since clustering algorithms cannot handle missing data, this step
ensures data quality.
The distribution of customer spending is visualized using a histogram.
This visualization helps identify spending patterns among customers. The distribution is right-skewed, meaning that: - Most customers have low spending - A small number of customers contribute very high revenue
This observation highlights the presence of high-value customers and confirms the need for feature scaling before clustering.
# Feature Distribution – Total Spend
# What this step does:
# Visualizes how customer spending is distributed across the dataset
hist(
online_retail$TotalSpend,
main = "Distribution of Total Spend",
xlab = "Total Spend",
col = "skyblue",
border = "white"
)
Interpretation:
The histogram shows a right-skewed distribution of customer spending,
where most customers have low total spend while a small number of
customers contribute significantly higher revenue.
This indicates the presence of high-value customers and justifies the
need for feature scaling before applying K-Means clustering.
For customer segmentation, three key behavioral features are selected:
These features collectively represent how valuable, active, and engaged a customer is.
Since K-Means is a distance-based algorithm, all features are scaled using standardization. This ensures that no single feature dominates the clustering process due to larger numeric values.
This step creates meaningful features from raw transaction data. These features are commonly used for customer segmentation in retail analytics.
# Select numeric features
features <- online_retail[, c("TotalSpend", "TotalQuantity", "Frequency")]
# Scale features
scaled_features <- scale(features)
# View summary
summary(scaled_features)
## TotalSpend TotalQuantity Frequency
## Min. :-4.00411 Min. :-0.23588 Min. :-0.42505
## 1st Qu.:-0.68559 1st Qu.:-0.20437 1st Qu.:-0.42505
## Median :-0.06218 Median :-0.16097 Median :-0.29514
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.65411 3rd Qu.:-0.03935 3rd Qu.: 0.09457
## Max. : 4.73104 Max. :38.78727 Max. :26.59496
Interpretation:
After scaling, all features have comparable ranges with mean close to
zero and unit variance.
This ensures that K-Means clustering is not biased toward variables with
larger numeric values.
# Elbow Method to determine optimal k
wss <- numeric()
for (k in 1:10) {
kmeans_model <- kmeans(scaled_features, centers = k, nstart = 25)
wss[k] <- kmeans_model$tot.withinss
}
plot(
1:10, wss,
type = "b",
pch = 19,
frame = FALSE,
xlab = "Number of Clusters (k)",
ylab = "Total Within-Cluster Sum of Squares",
main = "Elbow Method for Optimal k"
)
## Determining the Optimal Number of Clusters
The Elbow Method is used to determine the optimal number of clusters (k). It measures how the total within-cluster variance changes as the number of clusters increases.
The “elbow point” represents a balance between model simplicity and clustering performance.
Interpretation:
The elbow point in the graph indicates the optimal number of
clusters.
After this point, adding more clusters does not significantly reduce
within-cluster variance.
Based on the elbow curve, k = 3 clusters is selected
for this project.
# Apply K-Means clustering
set.seed(123)
kmeans_final <- kmeans(scaled_features, centers = 3, nstart = 25)
# Add cluster labels to dataset
online_retail$Cluster <- as.factor(kmeans_final$cluster)
# View cluster sizes
table(online_retail$Cluster)
##
## 1 2 3
## 2758 1556 24
Interpretation:
Customers have been successfully grouped into three distinct
clusters.
Each cluster represents a different purchasing behavior segment.
# Cluster visualization
ggplot(
online_retail,
aes(x = TotalSpend, y = TotalQuantity, color = Cluster)
) +
geom_point(alpha = 0.6) +
labs(
title = "Customer Segmentation using K-Means",
x = "Total Spend",
y = "Total Quantity"
) +
theme_minimal()
Interpretation:
The scatter plot shows clear separation between customer segments.
High-spending and high-quantity customers form distinct clusters,
confirming effective segmentation.
# Cluster profiling
aggregate(
cbind(TotalSpend, TotalQuantity, Frequency) ~ Cluster,
data = online_retail,
mean
)
## Cluster TotalSpend TotalQuantity Frequency
## 1 1 5.844509 268.7814 1.794416
## 2 2 7.851925 2044.4299 7.676735
## 3 3 11.100062 51890.8333 68.250000
Interpretation:
## Cluster Interpretation
Based on the K-Means clustering results, customers are grouped into three distinct segments according to their purchasing behavior.
Customers in this cluster have the lowest total spending, purchase fewer quantities, and have lower purchase frequency. These customers may be occasional buyers or new customers.
This cluster represents regular customers with moderate spending and purchase frequency. These customers form a stable customer base and have potential to be converted into high-value customers.
Customers in this cluster show the highest spending, purchase large quantities, and shop frequently. These are loyal and premium customers who contribute significantly to overall revenue.
The customer segments identified through clustering can be used to support business decision-making:
This project demonstrates an end-to-end machine learning workflow using R on the UCI Online Retail dataset. Customer segmentation was performed using K-Means clustering, an unsupervised learning algorithm.
The results provide meaningful insights into customer behavior and can help businesses improve marketing strategies, customer retention, and revenue growth.