1 Introduction

For the past years, there has been a strong and steady increase of online retail sales. According to report made by Interaktywnie.com, the growth of Polish online retail sales might reach up to 23% in 2018. It comes as no surprise that nowadays online customers are crucial to many shops, even stationary ones, as they often sell more online than face to face. Therefore, it is very important to understand the distinct characteristics of online clients.

The fact that in online shopping each customer has a unique ID (either Login or e-mail address), known to the seller, creates great opportunities. History of purchases can be used for dividing customers into groups based on their common characteristics, which is called Customer Segmentation. It allows not only get the better picture of customers and to understand them better, but also to reach out to them in a more effective and approprate way. In other words, it allows to treat each customers as an individual.

The online seller analysed in this paper is a small business that sells made in Poland children’s clothing and has 2 stationary points, suited for wholesale. The company also has been selling online at a Polish biggest e-commerce platform, Allegro, for the past 4 years. The variety of products at the online shop is very limited compared to the stationary shop.

In this article, I will try to use data mining techniques such as K-means and PAM to answer the following business concerns:

2 Methodology

2.1 Methods for validating clustering tendency

Clustering tendency assessment allows to check whether the data is clusterable. It is an important step in cluster analysis. In this section, the methods described will be Hopkins statistic and Visual Assessment of cluster Tendency.

2.1.1 Hopkins statistic

Hopkins statistic tests whether the given dataset is generated by a uniform data distribution. The formula for Hopkins statistic is defined as follows:

\[H = \frac{\overline{x_{dist}}}{\overline{x_{dist}}+\overline{y_{dist}}}\] where \(\overline{x_{dist}}\) is average distance to nearest neighbour between real data and \(\overline{y_{dist}}\) is average distance to nearest neighbour between real point.

The null hypothesis of the test is that the dataset is uniformly distributed (hence no significant clusters). The alternative hypothesis states that the dataset is not uniformly distributed.

The rule of the thumb is that if the Hopkins statistic is above 0.5 than the dataset is not clusterable.

If the value of H is close to zero, then we can reject null hypothesis and deduce that dataset contains meaningful clusters.

2.1.2 VAT: Visual Assessment of cluster Tendency

VAT is a visual approach for assessing cluster tendency. It provides a square digital image which shows pair-wise dissimilarity information about the data points. Objects are reordered in such a way that highlights potenial clusterings.

The algorithm of VAT contains 3 steps:

  1. Compute the dissimilarity matrix (DM) between the objects in the dataset using Euclidean distance or other,

  2. Reorder the DM in such way that similar objects are close to one another, which results in an ordered dissimilarity matrix (ODM),

  3. Display ODM as an ordered dissimilarity image (ODI).

ODI, which is visual output of VAT, visualises cluster tendency of dataset. If there are visible blocks along the diagonal in a VAT image, we can conclude that data is clusterable.

2.2 Clustering

K-means Clustering and PAM (Partitioning Around Medodoids) algorithm are types of unsupervised learning. This means that the dataset has not beem labeled, classified or categorised. The aim of these algorithms is to divide data into groups (clusters), with K number of groups.

In K-means and PAM clustering, each object belongs to one and only one cluster (which is known in literature as hard clustering). In a soft clustering method, by contrast, a single object can belong to many clusters, often with a confidence.

2.2.1 K-means

As said before, K-means aims to divide data into K groups (clusters). This involves finding K centroids. Centroids are invented or existing points that represent the centers of the clusters. In K-means clustering the aim is to indetify K number of centroids and allocate every object to the nearest cluster.

The algorithm for K-means is as follows:

  1. Place K points into the object data space. These points represent the initial group of centroids,

  2. Assign each data point to the closest centroid, based on Euclidean distance or other,

  3. Recalculate the positions of the K centroids. This is done by taking the mean of all data points assigned to that centroid’s cluster,

  4. Repeat steps 2 and 3 until the posistions of the centroids no longer change and the sum of distances of individual units from centroids is as small as possible.

2.2.2 PAM

The main difference between K-means and PAM method is that K-means uses centroids (usually artifficial points), while PAM uses medodoids, which are always the actual points in the dataset.

The algorith for PAM is similar to K-mean’s algorithm:

  1. Select random K objects for an initial set of medodoids,

  2. Assign each data point to the closest centroid, based on Euclidean distance or other,

  3. Try to improve the quality of clustering by exchanging selected objects with unselected objects,

  4. Repeat steps 2 and 3 until the average distance of objects from medodoids is minimised.

There is an extension to PAM method - CLARA (Clustering Large Applications). CLARA applies PAM algorithm not to the entire data set, but to a small sample of the data. It repeats the procedure of sampling and applying a pre-specified number of times so that the sampling bias is minimised. CLARA algorithm allows to reduce computing time and RAM memory problem which might happen while analysing a large dataset. This method is dedicated for data containing a large number of objects and thus will not be used in this article.

2.2.3 Choosing optimal K number

The algorithms described above divide dataset into clusters for a chosen K number of clusters. There are many statistic useful for choosing optimal K, but in general there is no method for determining exact value of K.

2.2.3.1 Silhouette

Silhouette provides a visualisation of how well each object lies within its cluster.

A silhouette is defined as follows:

\[s = \frac{b(i)-a(i)}{max\{a(i), b(i)\}}\] where \(a(i)\) is the average distance to all other data points in the cluster and b(i) is the minimum of average distance to other clusters.

Silhouette measure has a range of [-1, 1]. In general, positive silhouette values are preferable as they indicate that the sample is far away from the neighbouring clusters.

2.2.3.2 Average Within-Cluster Distance to Centroid

Other metric that is usually used to compare different number of clusters K is the average distance between data points and their cluster centroid. Increasing number of clusters will always decrease this statistics, so the goal is to find a point where the rate of decrease sharply shifts (“elbow point”).

2.2.3.3 Shadow statistics

The shadow of an object is very similar to its silhouette. Shadow value is defined as

\[s(i) = \frac{2a(i)}{a(i) + b(i)}\]

where \(a(i)\) is the distance of i data point to the centroid closest to i and \(b(i)\) is the distance of i data point to the second-closest centroid to i.

The desired shadow value is close to 0 as this means that the data point is close to its cluster centroid. Value close to 1 means that the data point is almost equal to 2 centroids.

2.2.3.4 Other statistics

There are many more metrics for assigning the optimal number of clusters. In R language, the package NbClust provides 30 indices for determining the optimal number of clusters. In this article, 26 of them will be calculated. Two of them - Hubert index and D index are graphical methods. The optimal number of clusters is in their “elbow point” (where the rate of statistics decreases shifts sharply).

3 Data description

The data used has been collected from 18.06.2017 until 11.12.2018 (1.5 years). All the clients analysed are individual customers from Poland.

Allegro’s sales management system allows to download transactions’ details concerning date of purchase, customer’s address, customer’s name and e-mail address, delivery info and product’s name, price and quantity sold. Therefore, each client has been identified by their own unique e-mail address. In order to use information about customers’ Cities size, the dataset has been merged with TERYT registry of Polish Central Statistical Office.

After tedious work with pre-processing the data, eleven variables have been chosen. Variables that have been used in the analysis are presented in table below.

VariableName Description
CustomerEmail E-mail address, unique for each client
Recency Time in month since the last purchase
FirstPurchase Time in month since the first purchase
SumTotal Sum of total value of all purchases per client
MeanTotal Average spending per client
MinTotal Minimum spending per client
MaxTotal Maximum spending per client
NumberOfPurchases Number of purchases throughout the analysed time
AvgDeliveryCost Average delivery cost
CustomerCitySize Customer City Size, where 1. Village or city with less than 10,000 population 2. 3. City with population between 50,000 and 100,000 4. City with population between 100,000 and 200,000 5. City with population between 200,000 and 500,000 6. City with population over 500,000
SalePercentage Percentage of on sale items in all purchases

The table below presents summary for the analysed variables.

kable(summarize(Customers2[,2:11], type = 'numeric'),format = 'html') %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
N Mean SD Min Q1 Median Q3 Max
Recency 1188 9.93 4.82 0.0 7.00 11.00 13.00 18.00
FirstPurchase 1188 10.29 4.75 0.0 8.00 12.00 14.00 18.00
SumTotal 1188 74.35 61.18 16.5 43.98 53.99 85.68 760.20
MeanTotal 1188 65.97 41.13 16.5 43.59 52.19 78.32 467.89
MinTotal 1188 64.39 40.43 16.5 43.19 50.49 74.58 467.89
MaxTotal 1188 67.68 43.88 16.5 43.59 52.49 80.98 467.89
NumberOfPurchases 1188 1.11 0.39 1.0 1.00 1.00 1.00 6.00
AvgDeliveryCost 1188 10.57 4.12 0.0 8.00 8.60 13.41 49.20
CustomerCitySize 1188 2.69 1.79 1.0 1.00 2.00 4.00 6.00
SalePercentage 1188 0.10 0.27 0.0 0.00 0.00 0.00 1.00

4 Selection of clustering method and number of clusters

4.1 Prediagnostics

Firstly, the clustering tendency of the data will be analysed, using Hopkins statistics.

h <- clustertend::hopkins(data = ToCluster, n = 50)
h
## $H
## [1] 0.02408151

Hopkins statistics for the data equals 0.0240815, which allows to reject the null hypothesis that data is uniformly distributed. From that, we can conclude that the data is highly clustered.

I will also plot The Visual Assessment of cluster tendency (VAT) to visually determine if there are any potenial clusterings

factoextra::get_clust_tendency(data = ToCluster, n = 5)
## $hopkins_stat
## [1] 0.0104772
## 
## $plot

VAT indicates clearly that the data is clusterable as there are visible blocks along the diagonal. Again, the Hopkins’ statistic is close to 0, therefore we can conclude that the dataset is significantly clusterable.

4.2 Choosing optimal number of clusters

26 indices for determining the number of clusters will be analysed, using ward method and euclidean distance. All statistics will be calculated using NbCLust package.

Firstly, graphical methods will be used - Hubert (Hubert and Arabie 1985) and D index (Lebart et al. (2000).

NbClust(ToCluster, distance="euclidean", min.nc=2, max.nc=10, method="ward.D2", index="dindex")

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
## 
## $All.index
##       2       3       4       5       6       7       8       9      10 
## 40.3854 34.0693 30.3791 29.4378 28.7658 24.8283 23.5907 21.6145 19.4293

In the plot of D index, a significant knee seems to be for 4 clusters. Therefore the optimal number of clusters according to D index is 4.

NbClust(ToCluster, distance="euclidean", min.nc=2, max.nc=10, method="ward.D2", index="hubert")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 
## $All.index
##  2  3  4  5  6  7  8  9 10 
##  0  0  0  0  0  0  0  0  0

In the plot of Hubert index, a significant knee appears to be for 3 clusters thus one can assume the optimal number of clusters is 3.

The results of analysing the rest of 26 indices is as follows:

kable(nb_best, format='html') %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Article Number_clusters Value_Index
KL Krzanowski and Lai 1988 3 4.847400e+00
CH Calinski and Harabasz 1974 10 1.597292e+03
Hartigan Hartigan 1975 3 6.232847e+02
CCC Sarle 1983 2 1.989570e+01
Scott Scott and Symons 1971 9 1.498768e+03
Marriot Marriot 1971 5 2.989247e+39
TrCovW Milligan and Cooper 1985 3 9.657114e+11
TraceW Milligan and Cooper 1985 3 1.758072e+06
Friedman Friedman and Rubin 1967 9 1.928971e+03
Rubin Friedman and Rubin 1967 9 -2.577300e+00
Cindex Hubert and Levin 1976 5 5.000000e-02
DB Davies and Bouldin 1979 6 7.743000e-01
Silhouette Rousseeuw 1987 3 6.203000e-01
Duda Duda and Hart 1973 NA NA
PseudoT2 Duda and Hart 1973 NA NA
Beale Beale 1969 NA NA
Ratkowsky Ratkowsky and Lance 1978 2 2.452000e-01
Ball Ball and Hall 1965 3 1.671217e+06
PtBiserial Milligan 1980, 1981 6 6.266000e-01
Frey Frey and Van Groenewoud 1972 1 NA
McClain McClain and Rao 1975 3 1.534000e-01
Dunn Dunn 1974 6 1.140000e-02
Hubert Hubert and Arabie 1985 3 NaN
SDindex Halkidi et al. 2000 6 8.320000e-02
Dindex Lebart et al. 2000 4 NaN
SDbw Halkidi and Vazirgiannis 2001 10 4.402000e-01

To sum up the results, a bar chart will be plotted.

barplot(table(nb_best[,"Number_clusters"]),
        col=c("grey"), 
        main="Optimal number of clusters", 
        xlab="Number of clusters",
        ylab="Frequency among all indices")

The bar plot shows that the most frequent optimal number of clusters for all indices is 3. According to the majority rule, the best number of clusters is 3.

4.2.1 K-means

The optimal number of clusters for K-means method will be visualised, usin g average silhouette and within cluster sums of squares statistics.

fviz_nbclust(ToCluster, FUNcluster=kmeans, method="silhouette")+theme_classic()

fviz_nbclust(ToCluster, kmeans, method="wss")+theme_classic()

From the graphs above one can conclude that the optimal number of clusters for k-means method is 2. However, the number can also be 3 or 4 as the differences do not seem significant.

Having taken into consideration previous analyses, the chosen number of clusters for K-means method is 3.

4.2.2 PAM

Analogous analysis will be conducted for PAM method. Average silhoutte and within cluster sums of squares statistics will be visualised.

fviz_nbclust(ToCluster, pam, method="silhouette")+theme_classic()

fviz_nbclust(ToCluster, pam, method="wss")+theme_classic()

The analysis of the average silhoutte and within cluster sums of squares statistics lead to conclusion that the optimal number of clusters for PAM is 2 or 3.

Similarly to K-means method, after deep analysis the chosen number of clusters for PAM method is also 3.

5 Clustering

In this section, the result of clustering with both K-means and PAM method will be shown.

There will be also plots of silhouettes for both methods in order to choose a method that fits the data better.

5.1 K-means

km <- eclust(ToCluster,FUNcluster="kmeans", k=3,hc_metric = "euclidean")

km.sil<-silhouette(km$cluster, dist(ToCluster))
fviz_silhouette(km.sil)
##   cluster size ave.sil.width
## 1       1  852          0.72
## 2       2  285          0.39
## 3       3   51          0.22

Average silhouette width for K-means is 0.6164138. Data points seem to be well matched to their own clusters.

5.2 PAM

pm <- eclust(ToCluster,FUNcluster="pam", k=3,hc_metric = "euclidean")

pm.sil<-silhouette(pm$cluster, dist(ToCluster))
fviz_silhouette(pm.sil)
##   cluster size ave.sil.width
## 1       1  809          0.69
## 2       2  282          0.43
## 3       3   97          0.13

Average silhouette width for PAM equals 0.5818517.

K-means seems to be the method that fits the data better, so the next steps will be performed for k-means only.

6 Post-diagnostics

6.1 Shadow statistics

d1<-cclust(ToCluster, 3, dist="euclidean")
shadow(d1)
##         1         2         3 
## 0.3196610 0.5768313 0.7002235
plot(shadow(d1))

Shadow values for clusters seem to be safely distant from 1. We can conclude that the data is clustered properly.

7 Results

Firstly, a stripe chart for all 3 groups will be drawn. Stripe schart shows the distance of every data point from the centroid of cluster.

stripes(d1)

We can see in the graph that cluster 2 is the most inconsistent - distances in this cluster are the biggest (which is undesirable). This means that the customers differ the most in this cluster.

Now box plots of every variable for each group will be drawn.

ToCluster.c<-cbind(ToCluster, km$cluster)
colnames(ToCluster.c)[11]<-c("Group")

df.m <- melt(ToCluster.c, id.var = "Group")
df.m$Group <- as.character(df.m$Group)

p <- ggplot(data = df.m, aes(x=variable, y=value)) +
  geom_boxplot(aes(fill = Group),outlier.size = 1) +
  facet_wrap( ~ variable, scales="free", ncol = 2) +
  xlab(label = NULL) + ylab(label = NULL) + ggtitle("Boxplots for 3 Groups of Clients") +
  guides(fill=guide_legend(title="Groups"))

p 

The analysis of box plots for each group can give very interesting insights.

Fristly, we can differantiate 3 groups of customers:

These are customers that buy on average the cheapest products. They are also very prone to delivery cost, as their average delivery cost is the least. They live in wide range of cities sizes, with the average of city population between 10,000 and 50,000.

These customers seem to be the least loyal group. They usually do not make another purchase.

These are customers that seem to have bought items ‘just passing by’. They buy rather more expensive items but do not come back. They might have bought products as presents (which would explain rather low average total value of single purchase) and they do not feel the need to buy other similar items in the future.

On average they live in a city with population between 10,000 and 50,000, but it varies a lot.

These customers does not seem to be a very loyal group.

These are customers that make the most number of purchases. Average value of their single purchase is also the biggest. They are by no doubt the most loyal customers.

Their first purchase was on average the furthest from now compared to other groups.

Interestingly, the average delivery cost is also the biggest. It might mean that these customers do not care about costs that much as they care about the quality of delivery. It might also be caused by the fact that they live in villages or small towns, where cheaper type of delivery (Paczkomaty) is not available.

As mentioned above, the most loyal customers come on average from small towns and villages. Thus, their loyalty might be caused by worse availibility of shops in their neighbourhood.

Unsurprisingly, this group is merely interested in items on sale. This means that offering them products on sale might not be very effective. This group is probably more prone to quality rather than price.

7.1 Descriptive statistics by group

To draw more detailed image of 3 groups of customers, descriptive statistics for each group will be calculated.

GroupsSummary <- describeBy(ToCluster.c[,1:10], ToCluster.c[,'Group'])

kable(GroupsSummary[[1]], format = "html", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
vars n mean sd median trimmed mad min max range skew kurtosis se
Recency 1 852 9.93 4.76 11.00 10.31 4.45 0.0 18.00 18.00 -0.66 -0.55 0.16
FirstPurchase 2 852 10.11 4.73 12.00 10.53 4.45 0.0 18.00 18.00 -0.71 -0.45 0.16
SumTotal 3 852 49.02 15.13 47.19 48.11 10.38 16.5 139.57 123.07 1.21 4.20 0.52
MeanTotal 4 852 46.95 11.66 46.59 47.18 9.49 16.5 75.18 58.68 -0.16 0.34 0.40
MinTotal 5 852 46.77 11.76 46.59 47.00 9.49 16.5 75.18 58.68 -0.16 0.33 0.40
MaxTotal 6 852 47.13 11.71 47.04 47.35 9.56 16.5 75.18 58.68 -0.16 0.34 0.40
NumberOfPurchases 7 852 1.05 0.24 1.00 1.00 0.00 1.0 4.00 3.00 5.79 43.12 0.01
AvgDeliveryCost 8 852 10.03 3.57 8.60 9.50 2.08 0.0 20.00 20.00 1.21 1.12 0.12
CustomerCitySize 9 852 2.73 1.79 2.00 2.54 1.48 1.0 6.00 5.00 0.78 -0.87 0.06
SalePercentage 10 852 0.10 0.29 0.00 0.00 0.00 0.0 1.00 1.00 2.71 5.48 0.01
kable(GroupsSummary[[2]], format = "html", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
vars n mean sd median trimmed mad min max range skew kurtosis se
Recency 1 285 9.99 4.96 11.00 10.39 4.45 0.00 18.00 18.00 -0.66 -0.76 0.29
FirstPurchase 2 285 10.66 4.83 12.00 11.18 4.45 0.00 18.00 18.00 -0.88 -0.36 0.29
SumTotal 3 285 113.09 32.53 103.97 108.70 29.34 75.58 240.15 164.57 1.19 1.14 1.93
MeanTotal 4 285 99.37 24.09 93.16 97.27 19.16 50.79 165.55 114.76 0.78 0.07 1.43
MinTotal 5 285 96.06 28.22 91.98 95.60 20.76 19.90 165.55 145.65 0.20 0.02 1.67
MaxTotal 6 285 102.77 23.98 97.17 100.28 21.02 58.98 187.36 128.38 0.86 0.12 1.42
NumberOfPurchases 7 285 1.19 0.46 1.00 1.07 0.00 1.00 3.00 2.00 2.44 5.35 0.03
AvgDeliveryCost 8 285 11.71 4.95 9.85 11.32 3.92 0.00 49.20 49.20 1.67 10.13 0.29
CustomerCitySize 9 285 2.67 1.81 2.00 2.46 1.48 1.00 6.00 5.00 0.80 -0.87 0.11
SalePercentage 10 285 0.10 0.22 0.00 0.04 0.00 0.00 1.00 1.00 2.39 5.11 0.01
kable(GroupsSummary[[3]], format = "html", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
vars n mean sd median trimmed mad min max range skew kurtosis se
Recency 1 51 9.55 5.19 11.00 9.76 4.45 0.00 18.00 18.00 -0.45 -1.05 0.73
FirstPurchase 2 51 11.41 4.50 13.00 11.93 2.97 1.00 18.00 17.00 -1.02 0.08 0.63
SumTotal 3 51 281.13 125.94 249.53 256.81 92.77 173.16 760.20 587.04 1.83 3.39 17.63
MeanTotal 4 51 197.23 66.77 183.95 189.30 39.30 80.22 467.89 387.66 1.77 4.69 9.35
MinTotal 5 51 181.75 82.85 182.56 177.80 54.84 34.98 467.89 432.91 0.80 1.85 11.60
MaxTotal 6 51 214.87 61.59 201.95 206.85 37.06 105.97 467.89 361.92 1.85 4.88 8.62
NumberOfPurchases 7 51 1.61 1.10 1.00 1.34 0.00 1.00 6.00 5.00 2.23 4.89 0.15
AvgDeliveryCost 8 51 13.15 5.28 15.00 13.27 7.41 0.00 20.00 20.00 -0.09 -1.11 0.74
CustomerCitySize 9 51 2.18 1.44 2.00 1.93 1.48 1.00 6.00 5.00 1.26 0.47 0.20
SalePercentage 10 51 0.05 0.10 0.00 0.02 0.00 0.00 0.43 0.43 2.10 3.60 0.01

These statistics reassure previous conclusions from box plots analyses. Customers from Group 3 are the most loyal ones and their average number of purchases is 1.61. More importantly, heir average value of single purchase is 98% and 320% bigger than Group 2 and Group 1 respectively. Additionaly, because they make more purchases, the average total value of all their purchases is 149% and 474% bigger than Group 2 and Group 1 respectively.

The loyal customers are just 4.29% of all customers, but their purchases make for 16.23% of total revenue. Price-oriented customers make for 47.28% of total revenue as they are the biggest group and thus are the most valuable group. Passersby customers make for 36.49% of total revenue.

8 Conclusions

K-means clustering allowed to divide customers into 3 groups: price-oriented customers, passersby and loyal customers.

The analysis resulted in important conclusions about each groups’ distinctive characteristics. It has been presented that the most valuable group is the price-oriented group which chooses the cheapest products and the cheapest delivery forms. The most loyal group on the contrary chooses rather more expensive types of delivery and they seem not to care that much about the price in general.

It is worth noting that now every new client can be assigned to one of the 3 groups and current clients can be reassigned to another groups as they make new purchases. This means that the company can address each client in a more appropriate way.