The primary purpose of this project is to apply unsupervised learning techniques, specifically clustering, to identify hidden operational patterns within a large-scale supply chain dataset. The dataset contains approximately 1.8 million transactional records related to shipping performance, delivery delays, sales, and profitability across multiple regions.
Traditional descriptive analysis provides only global averages and fails to capture regional heterogeneity in supply chain performance. Since no predefined labels exist to categorize regions as efficient or inefficient, an unsupervised approach is required. Clustering enables the discovery of latent structures within the data by grouping regions with similar operational and business characteristics.
The project aims to segment supply chain regions into distinct clusters based on logistics efficiency and business performance, thereby providing actionable insights for operational optimization and strategic decision-making.
Firstly we need to upload the neccessary libraries we need
Then we need to load our dataset
#getwd()
#setwd("C:/Users/ulugb/Downloads")
#getwd()
data <- read.csv("DataCoSupplyChainDataset.csv")
# We need to inspect the structure
str(data)## 'data.frame': 180519 obs. of 53 variables:
## $ Type : chr "DEBIT" "TRANSFER" "CASH" "DEBIT" ...
## $ Days.for.shipping..real. : int 3 5 4 3 2 6 2 2 3 2 ...
## $ Days.for.shipment..scheduled.: int 4 4 4 4 4 4 1 1 2 1 ...
## $ Benefit.per.order : num 91.2 -249.1 -247.8 22.9 134.2 ...
## $ Sales.per.customer : num 315 311 310 305 298 ...
## $ Delivery.Status : chr "Advance shipping" "Late delivery" "Shipping on time" "Advance shipping" ...
## $ Late_delivery_risk : int 0 1 0 0 0 0 1 1 1 1 ...
## $ Category.Id : int 73 73 73 73 73 73 73 73 73 73 ...
## $ Category.Name : chr "Sporting Goods" "Sporting Goods" "Sporting Goods" "Sporting Goods" ...
## $ Customer.City : chr "Caguas" "Caguas" "San Jose" "Los Angeles" ...
## $ Customer.Country : chr "Puerto Rico" "Puerto Rico" "EE. UU." "EE. UU." ...
## $ Customer.Email : chr "XXXXXXXXX" "XXXXXXXXX" "XXXXXXXXX" "XXXXXXXXX" ...
## $ Customer.Fname : chr "Cally" "Irene" "Gillian" "Tana" ...
## $ Customer.Id : int 20755 19492 19491 19490 19489 19488 19487 19486 19485 19484 ...
## $ Customer.Lname : chr "Holloway" "Luna" "Maldonado" "Tate" ...
## $ Customer.Password : chr "XXXXXXXXX" "XXXXXXXXX" "XXXXXXXXX" "XXXXXXXXX" ...
## $ Customer.Segment : chr "Consumer" "Consumer" "Consumer" "Home Office" ...
## $ Customer.State : chr "PR" "PR" "CA" "CA" ...
## $ Customer.Street : chr "5365 Noble Nectar Island" "2679 Rustic Loop" "8510 Round Bear Gate" "3200 Amber Bend" ...
## $ Customer.Zipcode : int 725 725 95125 90027 725 14150 725 33162 725 94583 ...
## $ Department.Id : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Department.Name : chr "Fitness" "Fitness" "Fitness" "Fitness" ...
## $ Latitude : num 18.3 18.3 37.3 34.1 18.3 ...
## $ Longitude : num -66 -66 -122 -118 -66 ...
## $ Market : chr "Pacific Asia" "Pacific Asia" "Pacific Asia" "Pacific Asia" ...
## $ Order.City : chr "Bekasi" "Bikaner" "Bikaner" "Townsville" ...
## $ Order.Country : chr "Indonesia" "India" "India" "Australia" ...
## $ Order.Customer.Id : int 20755 19492 19491 19490 19489 19488 19487 19486 19485 19484 ...
## $ order.date..DateOrders. : chr "1/31/2018 22:56" "1/13/2018 12:27" "1/13/2018 12:06" "1/13/2018 11:45" ...
## $ Order.Id : int 77202 75939 75938 75937 75936 75935 75934 75933 75932 75931 ...
## $ Order.Item.Cardprod.Id : int 1360 1360 1360 1360 1360 1360 1360 1360 1360 1360 ...
## $ Order.Item.Discount : num 13.1 16.4 18 22.9 29.5 ...
## $ Order.Item.Discount.Rate : num 0.04 0.05 0.06 0.07 0.09 ...
## $ Order.Item.Id : int 180517 179254 179253 179252 179251 179250 179249 179248 179247 179246 ...
## $ Order.Item.Product.Price : num 328 328 328 328 328 ...
## $ Order.Item.Profit.Ratio : num 0.29 -0.8 -0.8 0.08 0.45 ...
## $ Order.Item.Quantity : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sales : num 328 328 328 328 328 ...
## $ Order.Item.Total : num 315 311 310 305 298 ...
## $ Order.Profit.Per.Order : num 91.2 -249.1 -247.8 22.9 134.2 ...
## $ Order.Region : chr "Southeast Asia" "South Asia" "South Asia" "Oceania" ...
## $ Order.State : chr "Java Occidental" "Rajast\xe1n" "Rajast\xe1n" "Queensland" ...
## $ Order.Status : chr "COMPLETE" "PENDING" "CLOSED" "COMPLETE" ...
## $ Order.Zipcode : int NA NA NA NA NA NA NA NA NA NA ...
## $ Product.Card.Id : int 1360 1360 1360 1360 1360 1360 1360 1360 1360 1360 ...
## $ Product.Category.Id : int 73 73 73 73 73 73 73 73 73 73 ...
## $ Product.Description : logi NA NA NA NA NA NA ...
## $ Product.Image : chr "http://images.acmesports.sports/Smart+watch " "http://images.acmesports.sports/Smart+watch " "http://images.acmesports.sports/Smart+watch " "http://images.acmesports.sports/Smart+watch " ...
## $ Product.Name : chr "Smart watch " "Smart watch " "Smart watch " "Smart watch " ...
## $ Product.Price : num 328 328 328 328 328 ...
## $ Product.Status : int 0 0 0 0 0 0 0 0 0 0 ...
## $ shipping.date..DateOrders. : chr "2/3/2018 22:56" "1/18/2018 12:27" "1/17/2018 12:06" "1/16/2018 11:45" ...
## $ Shipping.Mode : chr "Standard Class" "Standard Class" "Standard Class" "Standard Class" ...
Count missing values per column
## Type Days.for.shipping..real.
## 0 0
## Days.for.shipment..scheduled. Benefit.per.order
## 0 0
## Sales.per.customer Delivery.Status
## 0 0
## Late_delivery_risk Category.Id
## 0 0
## Category.Name Customer.City
## 0 0
## Customer.Country Customer.Email
## 0 0
## Customer.Fname Customer.Id
## 0 0
## Customer.Lname Customer.Password
## 0 0
## Customer.Segment Customer.State
## 0 0
## Customer.Street Customer.Zipcode
## 0 3
## Department.Id Department.Name
## 0 0
## Latitude Longitude
## 0 0
## Market Order.City
## 0 0
## Order.Country Order.Customer.Id
## 0 0
## order.date..DateOrders. Order.Id
## 0 0
## Order.Item.Cardprod.Id Order.Item.Discount
## 0 0
## Order.Item.Discount.Rate Order.Item.Id
## 0 0
## Order.Item.Product.Price Order.Item.Profit.Ratio
## 0 0
## Order.Item.Quantity Sales
## 0 0
## Order.Item.Total Order.Profit.Per.Order
## 0 0
## Order.Region Order.State
## 0 0
## Order.Status Order.Zipcode
## 0 155679
## Product.Card.Id Product.Category.Id
## 0 0
## Product.Description Product.Image
## 180519 0
## Product.Name Product.Price
## 0 0
## Product.Status shipping.date..DateOrders.
## 0 0
## Shipping.Mode
## 0
#remove unnecessary data which is missing
data_cleaned <-data %>%
select(-Product.Description,-Order.Zipcode)Verify missing values after cleaning
## [1] 3
We do NOT cluster raw transactions
We aggregate by Order Region
# Aggregation: Combining operational and Marketing approach
sc_aggregated <- data_cleaned %>%
group_by(Order.Region) %>%
summarise(
# Operational
avg_real_days = mean(Days.for.shipping..real., na.rm = TRUE),
avg_scheduled_days = mean(Days.for.shipment..scheduled., na.rm = TRUE),
avg_late_risk_rate = mean(Late_delivery_risk, na.rm = TRUE),
# Business marketing related
avg_sales_per_customer = mean(Sales.per.customer, na.rm = TRUE),
avg_profit_per_order = mean(Order.Profit.Per.Order, na.rm = TRUE),
total_volume = n()
) %>%
# Use the column name in quotes
column_to_rownames("Order.Region")
# Check the result
head(sc_aggregated)## avg_real_days avg_scheduled_days avg_late_risk_rate
## Canada 3.331595 2.940563 0.4880083
## Caribbean 3.507213 2.960688 0.5307766
## Central Africa 3.560525 2.920692 0.5796064
## Central America 3.510462 2.948520 0.5475460
## Central Asia 3.417722 2.772152 0.5533454
## East Africa 3.514579 2.943844 0.5593952
## avg_sales_per_customer avg_profit_per_order total_volume
## Canada 175.4509 24.92253 959
## Caribbean 178.1280 20.65709 8318
## Central Africa 174.6646 19.94470 1677
## Central America 179.7343 21.74735 28341
## Central Asia 177.1314 23.59002 553
## East Africa 182.5347 23.30871 1852
This is the most critical step. “Sales” are in the hundreds, but
“Delay” is usually less than 1. Without scaling, the algorithm would
think Sales is 100x more important than Delay. Scaling makes them
equally important.
Final Impact: Every variable contributes fairly to the “distance” between clusters.
Required because variables have different scales
Hopkins statistic tells us whether clustering is meaningful or data is random
## [1] 0.6599166
Interpretation:
The Elbow method helps determine an appropriate number of clusters by examining how the within-cluster sum of squares (WSS) decreases as the number of clusters increases.
# Elbow method
fviz_nbclust(sc_scaled,kmeans, method = "wss")+labs(title = "Elbow Method", y="Total Within Cluster Sum of Squares")Measures how well-separated the clusters are:
Gap statistic compares observed clustering with a random reference distribution, gap_statistics that compares the total within-cluster variation for different values of k with their expected values under a null reference distribution.
set.seed(123)
gap_stat <- clusGap(sc_scaled, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)The Gap Statistic helps prove that even if the clusters aren’t “perfect,” they are still statistically real and not just coincidental patterns.
These tests provide statistical evidence for selecting K
library(fpc)
set.seed(123)
km2<-kmeans(sc_scaled,2, nstart = 25)
km3<-kmeans(sc_scaled,3, nstart = 25)
km4<-kmeans(sc_scaled,4, nstart = 25)
km5<-kmeans(sc_scaled,5, nstart = 25)
stats_k2 <- cluster.stats(dist(sc_scaled),km2$cluster)
stats_k3 <- cluster.stats(dist(sc_scaled),km3$cluster)
stats_k4 <- cluster.stats(dist(sc_scaled),km4$cluster)
stats_k5 <- cluster.stats(dist(sc_scaled),km5$cluster)
cat("CH index for k=2:", stats_k2$ch, "\n")
## CH index for k=2: 7.547178
cat("CH index for k=3:", stats_k3$ch, "\n")
## CH index for k=3: 8.72982
cat("CH index for k=4:", stats_k4$ch, "\n")
## CH index for k=4: 9.622308
cat("CH index for k=5:", stats_k5$ch, "\n")
## CH index for k=5: 9.967967
dh2<- dudahart2(sc_scaled, km2$cluster)
dh3<- dudahart2(sc_scaled, km3$cluster)
dh4<- dudahart2(sc_scaled, km4$cluster)
dh5<- dudahart2(sc_scaled, km5$cluster)
cat("Duda-Hart (p-value for K=2 split):", dh2$p.value, "\n")
## Duda-Hart (p-value for K=2 split): 0.07873072
cat("Duda-Hart (p-value for K=3 split):", dh3$p.value, "\n")
## Duda-Hart (p-value for K=3 split): 7.989265e-08
cat("Duda-Hart (p-value for K=4 split):", dh4$p.value, "\n")
## Duda-Hart (p-value for K=4 split): 1.130029e-11
cat("Duda-Hart (p-value for K=5 split):", dh5$p.value, "\n")
## Duda-Hart (p-value for K=5 split): 1.176448e-11Hopkins (0.66): As we suspected, our data is “Regular/Uniform.” This explains why our CH Index doesn’t have a massive peak—it’s climbing slowly because the data doesn’t have naturally separated “islands.”
CH Index: It is increasing as k increases (7.5 → 8.7 → 9.6). While it hasn’t peaked yet, the jump from k=2 to k=3 is a solid improvement.
Duda-Hart (The “Smoking Gun”):
Conclusion: We should proceed with k=3. It is the most defensible choice that balances statistical significance (Duda-Hart) with practical interpretability.
PAM is more robust than K-means for business data
#run final k-means
set.seed(123)
final_km <- kmeans(sc_scaled, centers = 3, nstart = 25)
final_pam <- pam(sc_scaled,k=3)
table(final_km$cluster, final_pam$clustering)##
## 1 2 3
## 1 0 8 8
## 2 1 0 2
## 3 3 1 0
fviz_cluster(final_pam,
data =sc_scaled,
palette="jco",
ellipse.type = "convex",
main="Final clustering PAM",
ggtheme = theme_minimal())We utilized PAM (Partitioning Around Medoids) for our final segmentation because it is significantly more robust to outliers than K-means.
To compare the two models, we used a Confusion Matrix (Cross-tabulation) to see how much the algorithms agree.
We used Principal Component Analysis (PCA) to squash our 5 different metrics (sales, profit, delays, etc.) into two dimensions so we can visualize them on a graph.
Hierarchical clustering confirms stability of clusters
# Calculating the distance Matrix
dist_matrix <- dist(sc_scaled, method="euclidean")
hc_ward <- hclust(dist_matrix, method="ward.D2")
fviz_dend(
hc_ward, k=3,
cex=0.7,
palette = "jco",
rect=TRUE,
rect_fill = TRUE,
main = "Dendogram: Hierarchical Grouping of Regions"
)
### Hierarchical Clustering Analysis
We implemented Hierarchical Clustering using Ward’s D2 method to complement our PAM analysis. The resulting Dendrogram acts as a ‘Family Tree’ for our supply chain regions. By cutting the tree at \(k=3\), we observed that the regional groupings remained largely consistent with our PAM model.
This high level of ‘inter-algorithm agreement’ provides strong confidence that our segmentation is based on real operational patterns rather than mathematical artifacts.
Compute business and operational statistics per cluster
cluster_profile <- sc_aggregated %>%
mutate(Region = rownames(.)) %>%
group_by(Cluster) %>%
summarise(
Count = n(),
Region_Names = paste(Region, collapse = ", "),
Avg_Delay = mean(avg_real_days - avg_scheduled_days),
Late_Risk_Pct = mean(avg_late_risk_rate) * 100,
Avg_Profit = mean(avg_profit_per_order),
Avg_Sales = mean(avg_sales_per_customer)
)
View(cluster_profile)
print(cluster_profile)## # A tibble: 3 × 7
## Cluster Count Region_Names Avg_Delay Late_Risk_Pct Avg_Profit Avg_Sales
## <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 4 "Canada, Northern … 0.513 53.3 24.1 191.
## 2 2 9 "Caribbean, Centra… 0.557 54.1 20.8 178.
## 3 3 10 "Central Africa, C… 0.582 55.7 22.4 178.
#How to visualize this project
library(tidyr)
plot_data <- cluster_profile %>%
select(Cluster, Avg_Delay,Avg_Profit) %>%
pivot_longer(cols=c(Avg_Delay,Avg_Profit), names_to = "Metric", values_to="Value")
ggplot(plot_data, aes(x = Cluster, y = Value, fill = Cluster)) +
geom_bar(stat = "identity") +
facet_wrap(~Metric, scales = "free") +
theme_minimal() +
labs(title = "Comparison of Operational Efficiency vs Profitability",
subtitle = "Cluster 1 shows clear dominance in Profit and Speed")
### Opperational Efficiency (Avg Delay) vs. Profitability (Avg Profit)
This chart is designed to show the “Trade-off” or the
“Sweet Spot” of our business operations.
Avg_Delay but the highest bar in
Avg_Profit. This proves that in our dataset, speed leads to
money. These regions are likely using optimized routes or superior
carriers.Business Insight: If management wants to increase profit, they should look at the logistics protocols used in Cluster 1 and apply them globally.
plot_data_v2 <- cluster_profile %>%
select(Cluster, Late_Risk_Pct, Avg_Sales) %>%
pivot_longer(cols = c(Late_Risk_Pct, Avg_Sales),
names_to = "Metric",
values_to = "Value")
# Create the plot
ggplot(plot_data_v2, aes(x = Cluster, y = Value, fill = Cluster)) +
geom_bar(stat = "identity") +
facet_wrap(~Metric, scales = "free") +
theme_minimal() +
scale_fill_brewer(palette = "Set1") + # Different color set for variety
labs(title = "Late Delivery Risk vs. Sales Volume by Cluster",
subtitle = "Cluster 3 has the highest risk, while Cluster 1 leads in sales",
x = "Cluster Group",
y = "Calculated Mean Value")This chart acts as our “Risk-Reward” map. It helps management identify exactly where the company’s revenue is most “in danger” due to logistics failures.
Late_Risk_Pct (Risk). Over 55.7% of orders here are at risk
of being late.Business Insight: Cluster 3 is our “Emergency Zone.” Because the sales volume is high, even a small improvement in delivery risk could prevent thousands of customer complaints and refund requests.
This analysis successfully transformed a massive dataset of 180,000+ logistics entries into a targeted, three-tiered strategic roadmap for global supply chain management.
Ultimately, this project has moved us from anecdotal observations to a statistically validated segmentation. By proving our results with the Duda-Hart test and cross-validating with Hierarchical Clustering, we have built a defensible framework. We aren’t just looking at charts; we have established a clear, data-driven path forward for increasing global supply chain reliability and profitability.