Online Retail ML Project(R)

Dataset Used

The dataset used in this project is the UCI Online Retail Dataset, which contains transactional data of a UK-based online retail store.
Each record represents a product purchase, including invoice number, product description, quantity, invoice date, unit price, customer ID, and country.

Machine Learning Task

The objective of this project is Customer Segmentation, which is an unsupervised learning problem.
The goal is to group customers based on their purchasing behavior to identify meaningful segments.

Possible Algorithms

The following clustering algorithms were considered:

K-Means Clustering
Hierarchical Clustering
DBSCAN

Algorithm Chosen and Justification

K-Means Clustering was selected because: - It is efficient for large datasets - Easy to interpret cluster results - Widely used in customer segmentation problems

Loading Required Libraries

The following libraries are required for data manipulation, visualization, and clustering.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

library(cluster)

## Warning: package 'cluster' was built under R version 4.4.3

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Interpretation:
These libraries are used for data manipulation (dplyr), visualization (ggplot2), clustering (cluster), and cluster visualization (factoextra).## Loading the Dataset

# Load the dataset
online_retail <- read.csv("data/processed/cleaned_online_retail.csv")

# View structure
str(online_retail)

## 'data.frame':    4338 obs. of  4 variables:
##  $ CustomerID   : int  12346 12347 12348 12349 12350 12352 12353 12354 12355 12356 ...
##  $ TotalSpend   : num  77184 4310 1797 1758 334 ...
##  $ TotalQuantity: int  74215 2458 2341 631 197 536 20 530 240 1591 ...
##  $ Frequency    : int  1 182 31 73 17 85 4 58 13 59 ...

# View first few rows
head(online_retail)

##   CustomerID TotalSpend TotalQuantity Frequency
## 1      12346   77183.60         74215         1
## 2      12347    4310.00          2458       182
## 3      12348    1797.24          2341        31
## 4      12349    1757.55           631        73
## 5      12350     334.40           197        17
## 6      12352    2506.04           536        85

What this code does:
This code loads the cleaned Online Retail dataset into R and displays its structure and sample records.

Interpretation:
The dataset contains customer transaction features that will be used for customer segmentation.
Basic inspection confirms that the data is properly loaded and ready for further analysis.

Exploratory Data Analysis (EDA)

Dataset Summary

What this step does:
This step provides a statistical summary of all numerical features in the dataset to understand data distribution and scale.

colSums(is.na(online_retail))

##    CustomerID    TotalSpend TotalQuantity     Frequency 
##             0             0             0             0

Interpretation:
The results indicate whether missing values exist.
Since clustering algorithms cannot handle missing data, this step ensures data quality.

Feature Distribution

What this step does:
This visualization helps understand how key variables are distributed across customers.

# Feature Distribution – Total Spend

# What this step does:
# Visualizes how customer spending is distributed across the dataset

hist(
  online_retail$TotalSpend,
  main = "Distribution of Total Spend",
  xlab = "Total Spend",
  col = "skyblue",
  border = "white"
)

Interpretation:
The histogram shows a right-skewed distribution of customer spending, where most customers have low total spend while a small number of customers contribute significantly higher revenue.
This indicates the presence of high-value customers and justifies the need for feature scaling before applying K-Means clustering.

Feature Engineering

What this step does:

This step creates meaningful features from raw transaction data. These features are commonly used for customer segmentation in retail analytics.

# Select numeric features
features <- online_retail[, c("TotalSpend", "TotalQuantity", "Frequency")]

# Scale features
scaled_features <- scale(features)

# View summary
summary(scaled_features)

##    TotalSpend       TotalQuantity        Frequency       
##  Min.   :-0.22811   Min.   :-0.23588   Min.   :-0.39653  
##  1st Qu.:-0.19433   1st Qu.:-0.20437   1st Qu.:-0.32660  
##  Median :-0.15349   Median :-0.16097   Median :-0.22170  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.04367   3rd Qu.:-0.03935   3rd Qu.: 0.03619  
##  Max.   :30.94278   Max.   :38.78727   Max.   :33.89766

Interpretation:
After scaling, all features have comparable ranges with mean close to zero and unit variance.
This ensures that K-Means clustering is not biased toward variables with larger numeric values.

# Elbow Method to determine optimal k
wss <- numeric()

for (k in 1:10) {
  kmeans_model <- kmeans(scaled_features, centers = k, nstart = 25)
  wss[k] <- kmeans_model$tot.withinss
}

## Warning: did not converge in 10 iterations

plot(
  1:10, wss,
  type = "b",
  pch = 19,
  frame = FALSE,
  xlab = "Number of Clusters (k)",
  ylab = "Total Within-Cluster Sum of Squares",
  main = "Elbow Method for Optimal k"
)

Interpretation (MANDATORY – paste below)

Interpretation:
The elbow point in the graph indicates the optimal number of clusters.
After this point, adding more clusters does not significantly reduce within-cluster variance.
Based on the elbow curve, k = 3 clusters is selected for this project.

# Apply K-Means clustering
set.seed(123)
kmeans_final <- kmeans(scaled_features, centers = 3, nstart = 25)

# Add cluster labels to dataset
online_retail$Cluster <- as.factor(kmeans_final$cluster)

# View cluster sizes
table(online_retail$Cluster)

## 
##    1    2    3 
## 3996  324   18

Interpretation:
Customers have been successfully grouped into three distinct clusters.
Each cluster represents a different purchasing behavior segment.

# Cluster visualization
ggplot(
  online_retail,
  aes(x = TotalSpend, y = TotalQuantity, color = Cluster)
) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Customer Segmentation using K-Means",
    x = "Total Spend",
    y = "Total Quantity"
  ) +
  theme_minimal()

Interpretation:
The scatter plot shows clear separation between customer segments.
High-spending and high-quantity customers form distinct clusters, confirming effective segmentation.

# Cluster profiling
aggregate(
  cbind(TotalSpend, TotalQuantity, Frequency) ~ Cluster,
  data = online_retail,
  mean
)

##   Cluster TotalSpend TotalQuantity  Frequency
## 1       1   1017.950      603.1271   57.66792
## 2       2   8835.539     5076.6883  405.45988
## 3       3 110053.719    61826.0556 2004.11111

Interpretation:
## Cluster Interpretation

Based on the K-Means clustering results, customers are grouped into three distinct segments according to their purchasing behavior.

Cluster 1 – Low-Value Customers

Customers in this cluster have the lowest total spending, purchase fewer quantities, and have lower purchase frequency. These customers may be occasional buyers or new customers.

Cluster 2 – Medium-Value Customers

This cluster represents regular customers with moderate spending and purchase frequency. These customers form a stable customer base and have potential to be converted into high-value customers.

Cluster 3 – High-Value Customers

Customers in this cluster show the highest spending, purchase large quantities, and shop frequently. These are loyal and premium customers who contribute significantly to overall revenue.

Business Insights

High-value customers should be retained using loyalty programs and personalized offers.
Medium-value customers can be targeted with promotions to increase their spending.
Low-value customers may be encouraged through discounts or onboarding campaigns.

Conclusion

This project demonstrates an end-to-end machine learning workflow using R on the UCI Online Retail dataset. Customer segmentation was performed using K-Means clustering, an unsupervised learning algorithm.

The results provide meaningful insights into customer behavior and can help businesses improve marketing strategies, customer retention, and revenue growth.