1 Introduction

For the past years, there has been a strong and steady increase of online retail sales. According to report made by Interaktywnie.com, the growth of Polish online retail sales might reach up to 23% in 2018. It comes as no surprise that nowadays online customers are crucial to many shops, even stationary ones, as they often sell more online than face to face. Therefore, it is very important to understand the distinct characteristics of online clients.

The fact that in online shopping each customer has a unique ID (either Login or e-mail address), known to the seller, creates great opportunities. History of purchases can be used for dividing customers into groups based on their common characteristics, which is called Customer Segmentation. It allows not only get the better picture of customers and to understand them better, but also to reach out to them in a more effective and approprate way. In other words, it allows to treat each customers as an individual.

The online seller analysed in this paper is a small business that sells made in Poland children’s clothing and has 2 stationary points, suited for wholesale. The company also has been selling online at a Polish biggest e-commerce platform, Allegro, for the past 4 years. The variety of products at the online shop is very limited compared to the stationary shop.

In this article, I will try to use data mining techniques such as K-means and PAM to answer the following business concerns:

2 Methodology

2.1 Methods for validating clustering tendency

Clustering tendency assessment allows to check whether the data is clusterable. It is an important step in cluster analysis. In this section, the methods described will be Hopkins statistic and Visual Assessment of cluster Tendency.

2.1.1 Hopkins statistic

Hopkins statistic tests whether the given dataset is generated by a uniform data distribution. The formula for Hopkins statistic is defined as follows:

\[H = \frac{\overline{x_{dist}}}{\overline{x_{dist}}+\overline{y_{dist}}}\] where \(\overline{x_{dist}}\) is average distance to nearest neighbour between real data and \(\overline{y_{dist}}\) is average distance to nearest neighbour between real point.

The null hypothesis of the test is that the dataset is uniformly distributed (hence no significant clusters). The alternative hypothesis states that the dataset is not uniformly distributed.

The rule of the thumb is that if the Hopkins statistic is above 0.5 than the dataset is not clusterable.

If the value of H is close to zero, then we can reject null hypothesis and deduce that dataset contains meaningful clusters.

2.1.2 VAT: Visual Assessment of cluster Tendency

VAT is a visual approach for assessing cluster tendency. It provides a square digital image which shows pair-wise dissimilarity information about the data points. Objects are reordered in such a way that highlights potenial clusterings.

The algorithm of VAT contains 3 steps:

  1. Compute the dissimilarity matrix (DM) between the objects in the dataset using Euclidean distance or other,

  2. Reorder the DM in such way that similar objects are close to one another, which results in an ordered dissimilarity matrix (ODM),

  3. Display ODM as an ordered dissimilarity image (ODI).

ODI, which is visual output of VAT, visualises cluster tendency of dataset. If there are visible blocks along the diagonal in a VAT image, we can conclude that data is clusterable.

2.2 Clustering

K-means Clustering and PAM (Partitioning Around Medodoids) algorithm are types of unsupervised learning. This means that the dataset has not beem labeled, classified or categorised. The aim of these algorithms is to divide data into groups (clusters), with K number of groups.

In K-means and PAM clustering, each object belongs to one and only one cluster (which is known in literature as hard clustering). In a soft clustering method, by contrast, a single object can belong to many clusters, often with a confidence.

2.2.1 K-means

As said before, K-means aims to divide data into K groups (clusters). This involves finding K centroids. Centroids are invented or existing points that represent the centers of the clusters. In K-means clustering the aim is to indetify K number of centroids and allocate every object to the nearest cluster.

The algorithm for K-means is as follows:

  1. Place K points into the object data space. These points represent the initial group of centroids,

  2. Assign each data point to the closest centroid, based on Euclidean distance or other,

  3. Recalculate the positions of the K centroids. This is done by taking the mean of all data points assigned to that centroid’s cluster,

  4. Repeat steps 2 and 3 until the posistions of the centroids no longer change and the sum of distances of individual units from centroids is as small as possible.

2.2.2 PAM

The main difference between K-means and PAM method is that K-means uses centroids (usually artifficial points), while PAM uses medodoids, which are always the actual points in the dataset.

The algorith for PAM is similar to K-mean’s algorithm:

  1. Select random K objects for an initial set of medodoids,

  2. Assign each data point to the closest centroid, based on Euclidean distance or other,

  3. Try to improve the quality of clustering by exchanging selected objects with unselected objects,

  4. Repeat steps 2 and 3 until the average distance of objects from medodoids is minimised.

There is an extension to PAM method - CLARA (Clustering Large Applications). CLARA applies PAM algorithm not to the entire data set, but to a small sample of the data. It repeats the procedure of sampling and applying a pre-specified number of times so that the sampling bias is minimised. CLARA algorithm allows to reduce computing time and RAM memory problem which might happen while analysing a large dataset. This method is dedicated for data containing a large number of objects and thus will not be used in this article.

2.2.3 Choosing optimal K number

The algorithms described above divide dataset into clusters for a chosen K number of clusters. There are many statistic useful for choosing optimal K, but in general there is no method for determining exact value of K. Silhouette

Silhouette provides a visualisation of how well each object lies within its cluster.

A silhouette is defined as follows:

\[s = \frac{b(i)-a(i)}{max\{a(i), b(i)\}}\] where \(a(i)\) is the average distance to all other data points in the cluster and b(i) is the minimum of average distance to other clusters.

Silhouette measure has a range of [-1, 1]. In general, positive silhouette values are preferable as they indicate that the sample is far away from the neighbouring clusters. Average Within-Cluster Distance to Centroid

Other metric that is usually used to compare different number of clusters K is the average distance between data points and their cluster centroid. Increasing number of clusters will always decrease this statistics, so the goal is to find a point where the rate of decrease sharply shifts (“elbow point”). Shadow statistics

The shadow of an object is very similar to its silhouette. Shadow value is defined as

\[s(i) = \frac{2a(i)}{a(i) + b(i)}\]

where \(a(i)\) is the distance of i data point to the centroid closest to i and \(b(i)\) is the distance of i data point to the second-closest centroid to i.

The desired shadow value is close to 0 as this means that the data point is close to its cluster centroid. Value close to 1 means that the data point is almost equal to 2 centroids. Other statistics

There are many more metrics for assigning the optimal number of clusters. In R language, the package NbClust provides 30 indices for determining the optimal number of clusters. In this article, 26 of them will be calculated. Two of them - Hubert index and D index are graphical methods. The optimal number of clusters is in their “elbow point” (where the rate of statistics decreases shifts sharply).

3 Data description

The data used has been collected from 18.06.2017 until 11.12.2018 (1.5 years). All the clients analysed are individual customers from Poland.

Allegro’s sales management system allows to download transactions’ details concerning date of purchase, customer’s address, customer’s name and e-mail address, delivery info and product’s name, price and quantity sold. Therefore, each client has been identified by their own unique e-mail address. In order to use information about customers’ Cities size, the dataset has been merged with TERYT registry of Polish Central Statistical Office.

After tedious work with pre-processing the data, eleven variables have been chosen. Variables that have been used in the analysis are presented in table below.

VariableName Description
CustomerEmail E-mail address, unique for each client
Recency Time in month since the last purchase
FirstPurchase Time in month since the first purchase
SumTotal Sum of total value of all purchases per client
MeanTotal Average spending per client
MinTotal Minimum spending per client
MaxTotal Maximum spending per client
NumberOfPurchases Number of purchases throughout the analysed time
AvgDeliveryCost Average delivery cost
CustomerCitySize Customer City Size, where 1. Village or city with less than 10,000 population 2. 3. City with population between 50,000 and 100,000 4. City with population between 100,000 and 200,000 5. City with population between 200,000 and 500,000 6. City with population over 500,000
SalePercentage Percentage of on sale items in all purchases

The table below presents summary for the analysed variables.

kable(summarize(Customers2[,2:11], type = 'numeric'),format = 'html') %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
N Mean SD Min Q1 Median Q3 Max
Recency 1188 9.93 4.82 0.0 7.00 11.00 13.00 18.00
FirstPurchase 1188 10.29 4.75 0.0 8.00 12.00 14.00 18.00
SumTotal 1188 74.35 61.18 16.5 43.98 53.99 85.68 760.20
MeanTotal 1188 65.97 41.13 16.5 43.59 52.19 78.32 467.89
MinTotal 1188 64.39 40.43 16.5 43.19 50.49 74.58 467.89
MaxTotal 1188 67.68 43.88 16.5 43.59 52.49 80.98 467.89
NumberOfPurchases 1188 1.11 0.39 1.0 1.00 1.00 1.00 6.00
AvgDeliveryCost 1188 10.57 4.12 0.0 8.00 8.60 13.41 49.20
CustomerCitySize 1188 2.69 1.79 1.0 1.00 2.00 4.00 6.00
SalePercentage 1188 0.10 0.27 0.0 0.00 0.00 0.00 1.00

4 Selection of clustering method and number of clusters

4.1 Prediagnostics

Firstly, the clustering tendency of the data will be analysed, using Hopkins statistics.

h <- clustertend::hopkins(data = ToCluster, n = 50)
## $H
## [1] 0.02408151

Hopkins statistics for the data equals 0.0240815, which allows to reject the null hypothesis that data is uniformly distributed. From that, we can conclude that the data is highly clustered.

I will also plot The Visual Assessment of cluster tendency (VAT) to visually determine if there are any potenial clusterings

factoextra::get_clust_tendency(data = ToCluster, n = 5)
## $hopkins_stat
## [1] 0.0104772
## $plot