The objective is to gain customer insights of an e-commerce company that sells electronic products. BY clustering, the company may be able to see where problems arise with shipping.
The data comes from Kaggle titled: E-Commerce Shipping Data.
Data Source: https://www.kaggle.com/datasets/prachi13/customer-analytics
The data contains 10,999 observations and 12 variables of shipments to meet customer demand.
Data Dictionary
| Variable | Type | Description |
|---|---|---|
ID |
Key | Unique customer ID |
Warehouse_block |
character | Warehouse divided into sections |
mode_of_shipment |
character | Ways of shipment |
customer_care_calls |
numeric | Number of inquires made about shipment |
customer_rating |
numeric | Company’s rating of customer 1 worse, 5 highest |
cost_of_the_product |
numeric | Product cost in U.S.D |
prior_purchases |
numeric | Number of prior purchases |
product_importance |
character | Importance level of product |
discounted_offer |
numeric | Discount on specific product |
weight_in_grams |
numeric | Shipment weight in grams |
reached on time |
binary | On time shipment, No/Yes |
cluster_df <- df %>% #### get data ready for the clustering analysis
select(-ID, -Warehouse_block, -Mode_of_Shipment, -Product_importance, -Reached.on.Time_Y.N)
# View the first 3 rows of the cluster_df
head(cluster_df, n = 3)
## # A tibble: 3 x 6
## Customer_care_calls Customer_rating Cost_of_the_Product Prior_purchases
## <dbl> <dbl> <dbl> <dbl>
## 1 4 2 177 3
## 2 4 5 216 2
## 3 2 2 183 4
## # ... with 2 more variables: Discount_offered <dbl>, Weight_in_gms <dbl>
The data must be standardized to make variables comparable. Standardization consists of transforming the variables such that they have mean zero and standard deviation one.
cluster_df <- scale(cluster_df) #Scale Data
head(cluster_df, n = 3)
## Customer_care_calls Customer_rating Cost_of_the_Product Prior_purchases
## [1,] -0.04770915 -0.7007232 -0.6906903 -0.3727178
## [2,] -0.04770915 1.4215131 0.1207401 -1.0293770
## [3,] -1.79980563 -0.7007232 -0.5658549 0.2839414
## Discount_offered Weight_in_gms
## [1,] 1.889897 -1.4681730
## [2,] 2.815508 -0.3338781
## [3,] 2.136727 -0.1589950
I am using two methods for finding the optimal number of clusters
* Elbow
* Silhouhette
# Elbow method
fviz_nbclust(cluster_df, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)+
labs(subtitle = "Elbow Method (k-means)")
# Silhouette method
fviz_nbclust(cluster_df, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette Method (k-means)")
The Elbow method suggests 3 clusters
The Silhouette method also suggest 3 clusters
The classification into groups requires computing the distance between each pair of observations. I used the Euclidean distance measurement.
The goal is to define clusters so that the total within-cluster sum of square (WSS), is minimized
# K-means clustering
k3 <- eclust(cluster_df, "kmeans", k = 3, nstart = 25, graph = FALSE)
# Visualize k-means clusters
fviz_cluster(k3, geom = "point", ellipse.type = "norm",
palette = "jco", ggtheme = theme_minimal())+
labs(subtitle = "k-3")
The silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.
Another internal clustering validation measure which can be computed as follow:
\[ {Dunn}=\frac{min.separation}{max.diameter}\]
Cluster 1 has 6110 shipments
Cluster 2 has 2259 shipments
Clsuter 3 has 2630 shipments
\[ {WithinSS}=\frac{between SumSquares}{totalSumSquares}=36.6\%\]
fviz_silhouette(k3, palette = "jco",
ggtheme = theme_classic())+
labs(subtitle = "sil.width 0.280457 k=3")
## cluster size ave.sil.width
## 1 1 6110 0.31
## 2 2 2259 0.23
## 3 3 2630 0.26
# Silhouette information
silinfo <- k3$silinfo
names(silinfo)
## [1] "widths" "clus.avg.widths" "avg.width"
# Silhouette widths of each observation
head(silinfo$widths[, 1:3], 10) ##### Show the first 10 rows only ###############
## cluster neighbor sil_width
## 3149 1 2 0.4950430
## 9117 1 2 0.4945994
## 7281 1 3 0.4915844
## 9817 1 2 0.4912080
## 6854 1 3 0.4909967
## 6720 1 3 0.4909031
## 9367 1 3 0.4907056
## 6151 1 3 0.4899460
## 10067 1 2 0.4895271
## 8961 1 2 0.4893568
# Average silhouette width of each cluster
silinfo$clus.avg.widths
## [1] 0.3099548 0.2261985 0.2585324
# The total average (mean of all individual silhouette widths)
silinfo$avg.width
## [1] 0.280457
# The size of each clusters
k3$size
## [1] 6110 2259 2630
# Mean of centers
k3$centers
## Customer_care_calls Customer_rating Cost_of_the_Product Prior_purchases
## 1 -0.2901637 -0.00512063 -0.1854847 -0.3295137
## 2 -0.3467057 -0.02306065 -0.4462048 -0.3189409
## 3 0.9719044 0.03170382 0.8141781 1.0394739
## Discount_offered Weight_in_gms
## 1 -0.4614149 0.6991846
## 2 1.6803274 -0.8671867
## 3 -0.3713364 -0.8794841
# Silhouette width of observation
sil <- k3$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
## cluster neighbor sil_width
## 1096 2 1 -0.0005269036
## 479 2 3 -0.0005719781
## 2107 2 1 -0.0008795047
## 10797 2 1 -0.0011351006
## 2312 2 1 -0.0012314614
## 2068 2 3 -0.0012327547
## 973 2 1 -0.0013022129
## 2147 2 1 -0.0021617341
## 855 2 1 -0.0024570893
## 1720 2 1 -0.0027780989
## 326 2 1 -0.0029749893
## 1806 2 1 -0.0033112966
## 2631 2 1 -0.0033883830
## 2793 2 1 -0.0039388251
## 1581 2 1 -0.0042308013
## 17 2 1 -0.0042657699
## 2820 2 1 -0.0044883226
## 1290 2 1 -0.0045382113
## 2067 2 1 -0.0046196610
## 2836 2 1 -0.0050646137
## 814 2 1 -0.0054701567
## 1939 2 1 -0.0055492790
## 2187 2 1 -0.0056608534
## 1836 2 1 -0.0059714680
## 2900 2 1 -0.0069136894
## 1982 2 1 -0.0069551233
## 960 2 1 -0.0069694733
## 2800 2 1 -0.0070539126
## 193 2 1 -0.0071616628
## 580 2 1 -0.0072355736
## 5029 2 1 -0.0074578834
## 2956 2 1 -0.0075075628
## 6873 2 1 -0.0076513570
## 9634 2 1 -0.0077932370
## 129 2 1 -0.0078414068
## 1556 2 1 -0.0079162269
## 2124 2 1 -0.0081938390
## 3111 2 1 -0.0083545840
## 2370 2 1 -0.0086985335
## 1527 2 1 -0.0087975288
## 9435 2 1 -0.0088018507
## 787 2 1 -0.0090182702
## 238 2 1 -0.0090996012
## 2569 2 1 -0.0091637303
## 10418 2 1 -0.0091666325
## 1568 2 1 -0.0091733752
## 2196 2 1 -0.0096337081
## 2744 2 1 -0.0097368413
## 3133 2 1 -0.0099033854
## 370 2 1 -0.0109138422
## 1680 2 3 -0.0112531564
## 2848 2 1 -0.0118743262
## 99 2 1 -0.0124242226
## 1620 2 1 -0.0125314915
## 9 2 1 -0.0127605422
## 1022 2 1 -0.0128129862
## 661 2 1 -0.0128337787
## 2423 2 1 -0.0137349725
## 1674 2 1 -0.0137732600
## 635 2 1 -0.0137844845
## 430 2 1 -0.0139465851
## 941 2 1 -0.0145884317
## 930 2 1 -0.0149714531
## 3080 2 1 -0.0159077701
## 1458 2 1 -0.0160846247
## 2190 2 1 -0.0160990472
## 2403 2 1 -0.0162223582
## 2858 2 1 -0.0162361808
## 2417 2 1 -0.0164529129
## 3070 2 1 -0.0166753821
## 2764 2 1 -0.0173889665
## 2637 2 1 -0.0174855388
## 6015 2 1 -0.0180657734
## 847 2 1 -0.0182793571
## 921 2 1 -0.0184699709
## 502 2 1 -0.0186906239
## 93 2 1 -0.0188079291
## 2630 2 1 -0.0191393642
## 1642 2 1 -0.0192152027
## 2933 2 1 -0.0196003288
## 2495 2 1 -0.0201099888
## 1512 2 1 -0.0203238361
## 1392 2 1 -0.0203833186
## 2173 2 1 -0.0207615231
## 2155 2 1 -0.0209858840
## 1172 2 1 -0.0213854627
## 1238 2 1 -0.0214813444
## 9665 2 1 -0.0218133009
## 2443 2 1 -0.0221532193
## 25 2 1 -0.0230020943
## 10809 2 1 -0.0239815658
## 2743 2 1 -0.0240065037
## 2515 2 1 -0.0244767281
## 2479 2 1 -0.0247459901
## 10783 2 1 -0.0253935814
## 748 2 1 -0.0253942499
## 2846 2 1 -0.0255786743
## 438 2 1 -0.0261166793
## 2488 2 1 -0.0268266900
## 1511 2 1 -0.0284204983
## 860 2 1 -0.0287756320
## 2258 2 1 -0.0292228663
## 1518 2 1 -0.0317200434
## 2065 2 1 -0.0322611834
## 1002 2 1 -0.0325787929
## 3828 2 1 -0.0334662630
## 5520 2 1 -0.0337298421
## 2815 2 1 -0.0342673419
## 1959 2 1 -0.0347072750
## 1974 2 1 -0.0354738947
## 1036 2 1 -0.0367363420
## 6608 2 1 -0.0372143497
## 2728 2 1 -0.0373883755
## 1406 2 1 -0.0378728076
## 3090 2 1 -0.0387402278
## 582 2 1 -0.0389791999
## 799 2 1 -0.0404630270
## 1364 2 1 -0.0418164374
## 2794 2 1 -0.0420478611
## 2893 2 1 -0.0422478763
## 1241 2 1 -0.0423856511
## 712 2 1 -0.0433060516
## 2734 2 1 -0.0439719055
## 2462 2 1 -0.0457461281
## 614 2 1 -0.0460951563
## 1817 2 1 -0.0476833797
## 1935 2 1 -0.0480379266
## 3062 2 1 -0.0485713726
## 1408 2 1 -0.0486816787
## 2880 2 1 -0.0489163391
## 126 2 1 -0.0500652266
## 740 2 1 -0.0507149783
## 385 2 1 -0.0514337774
## 727 2 1 -0.0518145973
## 449 2 1 -0.0521921822
## 2174 2 1 -0.0526354227
## 1276 2 1 -0.0540903731
## 2359 2 1 -0.0593596541
## 1943 2 1 -0.0612332471
## 2313 2 1 -0.0615006387
## 10604 3 1 -0.0006942137
## 1417 3 1 -0.0009946183
## 1898 3 2 -0.0010815538
## 7954 3 1 -0.0012410138
## 6529 3 1 -0.0017249349
## 2902 3 1 -0.0019942618
## 6532 3 1 -0.0021891865
## 1380 3 1 -0.0022026573
## 3734 3 1 -0.0022326446
## 1658 3 2 -0.0024861684
## 8813 3 1 -0.0025145384
## 2549 3 1 -0.0025360470
## 4986 3 1 -0.0029482627
## 9121 3 1 -0.0032196408
## 6796 3 1 -0.0033592286
## 9727 3 1 -0.0033601384
## 7928 3 1 -0.0034443391
## 4052 3 1 -0.0037965742
## 1577 3 1 -0.0044440530
## 9799 3 1 -0.0052427785
## 6804 3 1 -0.0053976693
## 9127 3 1 -0.0054847337
## 4070 3 1 -0.0058174306
## 9715 3 1 -0.0059018811
## 2699 3 1 -0.0060580787
## 9234 3 1 -0.0061625210
## 7941 3 1 -0.0063737555
## 5199 3 1 -0.0068411079
## 10989 3 1 -0.0068938935
## 8065 3 1 -0.0069201572
## 4949 3 1 -0.0072259220
## 4535 3 1 -0.0072798171
## 5632 3 1 -0.0073350903
## 510 3 1 -0.0074914003
## 3560 3 1 -0.0077099106
## 7391 3 1 -0.0081422384
## 9763 3 1 -0.0082475336
## 2686 3 1 -0.0091630612
## 1776 3 1 -0.0096692775
## 6093 3 1 -0.0096919336
## 1370 3 1 -0.0097774285
## 10027 3 1 -0.0100013378
## 10002 3 1 -0.0100278847
## 4290 3 1 -0.0101038316
## 7443 3 1 -0.0108752768
## 6523 3 1 -0.0112455716
## 2329 3 1 -0.0114232118
## 8294 3 1 -0.0120609437
## 1215 3 1 -0.0120707317
## 781 3 1 -0.0120810468
## 3383 3 1 -0.0126660251
## 10554 3 1 -0.0129936385
## 10619 3 1 -0.0130025429
## 7564 3 1 -0.0135410213
## 5399 3 1 -0.0141935126
## 2722 3 1 -0.0149763051
## 9842 3 1 -0.0153023286
## 9328 3 1 -0.0159861524
## 2478 3 1 -0.0163124016
## 7719 3 1 -0.0164057108
## 5972 3 1 -0.0168496473
## 555 3 1 -0.0172244762
## 10699 3 1 -0.0173114533
## 3231 3 1 -0.0173455195
## 508 3 1 -0.0174889233
## 6964 3 1 -0.0176992760
## 9996 3 1 -0.0177226323
## 6067 3 1 -0.0177671015
## 887 3 1 -0.0184704061
## 7198 3 1 -0.0189258630
## 5623 3 1 -0.0204724945
## 10684 3 1 -0.0208751851
## 4983 3 1 -0.0209756924
## 3359 3 1 -0.0209893387
## 7125 3 1 -0.0213709010
## 7900 3 1 -0.0213780968
## 6765 3 1 -0.0214784118
## 8445 3 1 -0.0225360259
## 7300 3 1 -0.0233047806
## 72 3 1 -0.0234097741
## 3606 3 1 -0.0240939736
## 4527 3 1 -0.0241745736
## 8217 3 1 -0.0244194187
## 9620 3 1 -0.0246786602
## 1651 3 1 -0.0250882040
## 6487 3 1 -0.0253010184
## 10558 3 1 -0.0256688773
## 7789 3 1 -0.0258330302
## 9599 3 1 -0.0283922632
## 7974 3 1 -0.0289425029
## 6709 3 1 -0.0291033846
## 4882 3 1 -0.0291061000
## 10855 3 1 -0.0300823794
## 5795 3 1 -0.0303566875
## 1449 3 1 -0.0308323468
## 5449 3 1 -0.0311982359
## 375 3 1 -0.0312038656
## 10435 3 1 -0.0314755675
## 892 3 1 -0.0320920186
## 9619 3 1 -0.0321846683
## 388 3 1 -0.0322118531
## 1075 3 1 -0.0329209900
## 5402 3 1 -0.0330582699
## 7135 3 1 -0.0332851481
## 360 3 1 -0.0333873602
## 1931 3 1 -0.0336503071
## 1426 3 1 -0.0344522816
## 9646 3 1 -0.0351293674
## 794 3 1 -0.0354484610
## 2632 3 1 -0.0363982053
## 5686 3 1 -0.0376684702
## 8134 3 1 -0.0377870463
## 3338 3 1 -0.0388261170
## 4251 3 1 -0.0391779233
## 2172 3 1 -0.0403533142
## 6036 3 1 -0.0407007003
## 9325 3 1 -0.0415739430
## 5108 3 1 -0.0417472927
## 5385 3 1 -0.0419241739
## 7366 3 1 -0.0419762549
## 5110 3 1 -0.0426026291
## 2193 3 1 -0.0428453457
## 5337 3 1 -0.0428918426
## 9859 3 1 -0.0430869679
## 3050 3 1 -0.0435726941
## 9527 3 1 -0.0446878560
## 6118 3 1 -0.0454558905
## 8684 3 1 -0.0455120873
## 3309 3 1 -0.0458693399
## 1403 3 1 -0.0461221991
## 3878 3 1 -0.0462856553
## 6344 3 1 -0.0477136248
## 8409 3 1 -0.0488036209
## 9290 3 1 -0.0490358391
## 5937 3 1 -0.0500607041
## 4447 3 1 -0.0513398592
## 3391 3 1 -0.0519153485
## 5368 3 1 -0.0522132709
## 1371 3 1 -0.0556212586
## 8781 3 1 -0.0582108249
## 6695 3 1 -0.0601786352
## 6241 3 1 -0.0606194105
## 5587 3 1 -0.0607596215
## 6174 3 1 -0.0609305178
## 9991 3 1 -0.0617114630
## 8096 3 1 -0.0671009615
## 3039 3 1 -0.0710079135
## 7000 3 1 -0.0718982311
## 5751 3 1 -0.0743151009
## 9210 3 1 -0.0785673821
Dunn Index = (minimum distance between two clusters)/(maximum distance of points within the cluster)
The Dunn index should be as high as possible for the clusters to be stable.
# Statistics for k-means clustering
k3_stats <- cluster.stats(dist(cluster_df), k3$cluster)
# Dun index
k3_stats$dunn
## [1] 0.009534497
k3_stats
## $n
## [1] 10999
##
## $cluster.number
## [1] 3
##
## $cluster.size
## [1] 6110 2259 2630
##
## $min.cluster.size
## [1] 2259
##
## $noisen
## [1] 0
##
## $diameter
## [1] 7.528389 7.305424 7.807101
##
## $average.distance
## [1] 2.437688 2.818444 2.740155
##
## $median.distance
## [1] 2.403743 2.795968 2.590132
##
## $separation
## [1] 0.07443678 0.07443678 0.13498955
##
## $average.toother
## [1] 3.743315 3.875337 3.851397
##
## $separation.matrix
## [,1] [,2] [,3]
## [1,] 0.00000000 0.07443678 0.1349896
## [2,] 0.07443678 0.00000000 0.3643669
## [3,] 0.13498955 0.36436693 0.0000000
##
## $ave.between.matrix
## [,1] [,2] [,3]
## [1,] 0.000000 3.751563 3.736231
## [2,] 3.751563 0.000000 4.162890
## [3,] 3.736231 4.162890 0.000000
##
## $average.between
## [1] 3.81292
##
## $average.within
## [1] 2.588212
##
## $n.between
## [1] 35812960
##
## $n.within
## [1] 24670541
##
## $max.diameter
## [1] 7.807101
##
## $min.separation
## [1] 0.07443678
##
## $within.cluster.ss
## [1] 41813.63
##
## $clus.avg.silwidths
## 1 2 3
## 0.3099548 0.2261985 0.2585324
##
## $avg.silwidth
## [1] 0.280457
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.5786845
##
## $dunn
## [1] 0.009534497
##
## $dunn2
## [1] 1.325636
##
## $entropy
## [1] 0.9937922
##
## $wb.ratio
## [1] 0.6788005
##
## $ch
## [1] 3178.646
##
## $cwidegap
## [1] 1.748130 1.767741 1.897209
##
## $widestgap
## [1] 1.897209
##
## $sindex
## [1] 0.5388646
##
## $corrected.rand
## NULL
##
## $vi
## NULL
#add cluster assignment to original data
final_data <- cbind(df, cluster = k3$cluster)
#view final data
head(final_data, 20)
## ID Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating
## 1 1 D Flight 4 2
## 2 2 F Flight 4 5
## 3 3 A Flight 2 2
## 4 4 B Flight 3 3
## 5 5 C Flight 2 2
## 6 6 F Flight 3 1
## 7 7 D Flight 3 4
## 8 8 F Flight 4 1
## 9 9 A Flight 3 4
## 10 10 B Flight 3 2
## 11 11 C Flight 3 4
## 12 12 F Flight 4 5
## 13 13 D Flight 3 5
## 14 14 F Flight 4 4
## 15 15 A Flight 4 3
## 16 16 B Flight 4 3
## 17 17 C Flight 3 4
## 18 18 F Ship 5 5
## 19 19 D Ship 5 5
## 20 20 F Ship 4 5
## Cost_of_the_Product Prior_purchases Product_importance Discount_offered
## 1 177 3 low 44
## 2 216 2 low 59
## 3 183 4 low 48
## 4 176 4 medium 10
## 5 184 3 medium 46
## 6 162 3 medium 12
## 7 250 3 low 3
## 8 233 2 low 48
## 9 150 3 low 11
## 10 164 3 medium 29
## 11 189 2 medium 12
## 12 232 3 medium 32
## 13 198 3 medium 1
## 14 275 3 high 29
## 15 152 3 low 43
## 16 227 3 low 45
## 17 143 2 medium 6
## 18 227 3 medium 36
## 19 239 3 high 18
## 20 145 3 medium 45
## Weight_in_gms Reached.on.Time_Y.N cluster
## 1 1233 1 2
## 2 3088 1 2
## 3 3374 1 2
## 4 1177 1 2
## 5 2484 1 2
## 6 1417 1 2
## 7 2371 1 1
## 8 2804 1 2
## 9 1861 1 2
## 10 1187 1 2
## 11 2888 1 1
## 12 3253 1 2
## 13 3667 1 1
## 14 2602 1 2
## 15 1009 1 2
## 16 2707 1 2
## 17 1194 1 2
## 18 3952 1 2
## 19 2495 1 3
## 20 1059 1 2
The e-commerce data tracks shipments of electronic products. The company has a large warehouse that is divided up into 5 blocks, depending on the product segment. The data also tracks mode of shipment, and the number of inquires about a shipment, and if a shipment is late or not.
By clustering the data, the company can assess if further investigation into the cluster with the most problems. Maybe it has to do with the warehouse block, or late shipments may be out of their control by the mode of shipment or weather.
For this model, the k-means method did not perform well with this data set. Likely because K-means does not handle outliers well. My suggestion is to segregate the data and drill down on warehouse blocks or drill down on all the observations that did not arrive on time. Or pick an supervised learning method.