The dataset used in this project is the UCI Online Retail
Dataset, which contains transactional data of a UK-based online
retail store.
Each record represents a product purchase, including invoice number,
product description, quantity, invoice date, unit price, customer ID,
and country.
The objective of this project is Customer
Segmentation, which is an unsupervised
learning problem.
The goal is to group customers based on their purchasing behavior to
identify meaningful segments.
The following clustering algorithms were considered:
K-Means Clustering was selected because: - It is efficient for large datasets - Easy to interpret cluster results - Widely used in customer segmentation problems
The following libraries are required for data manipulation, visualization, and clustering.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(cluster)
## Warning: package 'cluster' was built under R version 4.4.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Interpretation:
These libraries are used for data manipulation (dplyr),
visualization (ggplot2), clustering (cluster),
and cluster visualization (factoextra).## Loading the
Dataset
# Load the dataset
online_retail <- read.csv("data/processed/cleaned_online_retail.csv")
# View structure
str(online_retail)
## 'data.frame': 4338 obs. of 4 variables:
## $ CustomerID : int 12346 12347 12348 12349 12350 12352 12353 12354 12355 12356 ...
## $ TotalSpend : num 77184 4310 1797 1758 334 ...
## $ TotalQuantity: int 74215 2458 2341 631 197 536 20 530 240 1591 ...
## $ Frequency : int 1 182 31 73 17 85 4 58 13 59 ...
# View first few rows
head(online_retail)
## CustomerID TotalSpend TotalQuantity Frequency
## 1 12346 77183.60 74215 1
## 2 12347 4310.00 2458 182
## 3 12348 1797.24 2341 31
## 4 12349 1757.55 631 73
## 5 12350 334.40 197 17
## 6 12352 2506.04 536 85
What this code does:
This code loads the cleaned Online Retail dataset into R and displays
its structure and sample records.
Interpretation:
The dataset contains customer transaction features that will be used for
customer segmentation.
Basic inspection confirms that the data is properly loaded and ready for
further analysis.
What this step does:
This step provides a statistical summary of all numerical features in
the dataset to understand data distribution and scale.
colSums(is.na(online_retail))
## CustomerID TotalSpend TotalQuantity Frequency
## 0 0 0 0
Interpretation:
The results indicate whether missing values exist.
Since clustering algorithms cannot handle missing data, this step
ensures data quality.
What this step does:
This visualization helps understand how key variables are distributed
across customers.
# Feature Distribution – Total Spend
# What this step does:
# Visualizes how customer spending is distributed across the dataset
hist(
online_retail$TotalSpend,
main = "Distribution of Total Spend",
xlab = "Total Spend",
col = "skyblue",
border = "white"
)
Interpretation:
The histogram shows a right-skewed distribution of customer spending,
where most customers have low total spend while a small number of
customers contribute significantly higher revenue.
This indicates the presence of high-value customers and justifies the
need for feature scaling before applying K-Means clustering.
This step creates meaningful features from raw transaction data. These features are commonly used for customer segmentation in retail analytics.
# Select numeric features
features <- online_retail[, c("TotalSpend", "TotalQuantity", "Frequency")]
# Scale features
scaled_features <- scale(features)
# View summary
summary(scaled_features)
## TotalSpend TotalQuantity Frequency
## Min. :-0.22811 Min. :-0.23588 Min. :-0.39653
## 1st Qu.:-0.19433 1st Qu.:-0.20437 1st Qu.:-0.32660
## Median :-0.15349 Median :-0.16097 Median :-0.22170
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.04367 3rd Qu.:-0.03935 3rd Qu.: 0.03619
## Max. :30.94278 Max. :38.78727 Max. :33.89766
Interpretation:
After scaling, all features have comparable ranges with mean close to
zero and unit variance.
This ensures that K-Means clustering is not biased toward variables with
larger numeric values.
# Elbow Method to determine optimal k
wss <- numeric()
for (k in 1:10) {
kmeans_model <- kmeans(scaled_features, centers = k, nstart = 25)
wss[k] <- kmeans_model$tot.withinss
}
## Warning: did not converge in 10 iterations
plot(
1:10, wss,
type = "b",
pch = 19,
frame = FALSE,
xlab = "Number of Clusters (k)",
ylab = "Total Within-Cluster Sum of Squares",
main = "Elbow Method for Optimal k"
)
Interpretation:
The elbow point in the graph indicates the optimal number of
clusters.
After this point, adding more clusters does not significantly reduce
within-cluster variance.
Based on the elbow curve, k = 3 clusters is selected
for this project.
# Apply K-Means clustering
set.seed(123)
kmeans_final <- kmeans(scaled_features, centers = 3, nstart = 25)
# Add cluster labels to dataset
online_retail$Cluster <- as.factor(kmeans_final$cluster)
# View cluster sizes
table(online_retail$Cluster)
##
## 1 2 3
## 3996 324 18
Interpretation:
Customers have been successfully grouped into three distinct
clusters.
Each cluster represents a different purchasing behavior segment.
# Cluster visualization
ggplot(
online_retail,
aes(x = TotalSpend, y = TotalQuantity, color = Cluster)
) +
geom_point(alpha = 0.6) +
labs(
title = "Customer Segmentation using K-Means",
x = "Total Spend",
y = "Total Quantity"
) +
theme_minimal()
Interpretation:
The scatter plot shows clear separation between customer segments.
High-spending and high-quantity customers form distinct clusters,
confirming effective segmentation.
# Cluster profiling
aggregate(
cbind(TotalSpend, TotalQuantity, Frequency) ~ Cluster,
data = online_retail,
mean
)
## Cluster TotalSpend TotalQuantity Frequency
## 1 1 1017.950 603.1271 57.66792
## 2 2 8835.539 5076.6883 405.45988
## 3 3 110053.719 61826.0556 2004.11111
Interpretation:
## Cluster Interpretation
Based on the K-Means clustering results, customers are grouped into three distinct segments according to their purchasing behavior.
Customers in this cluster have the lowest total spending, purchase fewer quantities, and have lower purchase frequency. These customers may be occasional buyers or new customers.
This cluster represents regular customers with moderate spending and purchase frequency. These customers form a stable customer base and have potential to be converted into high-value customers.
Customers in this cluster show the highest spending, purchase large quantities, and shop frequently. These are loyal and premium customers who contribute significantly to overall revenue.
This project demonstrates an end-to-end machine learning workflow using R on the UCI Online Retail dataset. Customer segmentation was performed using K-Means clustering, an unsupervised learning algorithm.
The results provide meaningful insights into customer behavior and can help businesses improve marketing strategies, customer retention, and revenue growth.