Introduction

In general, clustering is the problem of finding groups within the elements in the data. The main objective of clustering algorithms is to study the structure of data in order to classify objects in a way such that those in the same group are more similar to each other than objects in other groups. The clustering task is an unsupervised machine learning field of interest since when dealing with real-world problems, data does not have predefined labels. Customer segmentation is one such problem, the goal of which is to group customers with similar tastes. This, in turn, can be utilized by sellers to achieve more efficient marketing by presenting more personalized offers. In this paper, we will perform customer segmentation using clustering techniques, such as:

  • K-means
  • PAM
  • CLARA

Libraries

library(cluster)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(ClusterR)
library(flexclust)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Dataset

The dataset used in this paper consists of 200 observations and 5 variables, namely customers’ gender, age, annual income, spending score, and cluster (although the last variable will not be used).

#reading the dataset
customers<-read.csv("customers.csv", sep=",", dec=".", header=TRUE) 
head(customers)
##   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100. cluster
## 1          1   Male  19                 15                     39       4
## 2          2   Male  21                 15                     81       3
## 3          3 Female  20                 16                      6       4
## 4          4 Female  23                 16                     77       3
## 5          5 Female  31                 17                     40       4
## 6          6 Female  22                 17                     76       3
#getting into statistics details of the dataset
summary(customers)
##    CustomerID        Gender               Age        Annual.Income..k..
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50    
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00    
##  Spending.Score..1.100.    cluster     
##  Min.   : 1.00          Min.   :0.000  
##  1st Qu.:34.75          1st Qu.:1.000  
##  Median :50.00          Median :2.000  
##  Mean   :50.20          Mean   :2.245  
##  3rd Qu.:73.00          3rd Qu.:4.000  
##  Max.   :99.00          Max.   :5.000
#printing the dimensions of the dataset
dim(customers)
## [1] 200   6
#checking whether there are any NA values
sum(is.na(customers))
## [1] 0

Based on the summary, we can see that this dataset consists of customers aged between 18 and 70, and the average age of a customer in the dataset is 38.85.

Data exploration - plots

plot(customers$Age, customers$Spending.Score..1.100., main = "Customers spending score depending on the age", xlab="Age", ylab="Spending score", col="pink")

d<-data.frame(customers$Age, customers$Spending.Score..1.100.)

From the plot, we can see that customers aged below 40 have the highest spending scores.

Histograms

ggplot(customers,aes(x=Age)) +
  geom_histogram(col="pink")+
  labs(title="Distribution of the Age")+  theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(customers,aes(x=Spending.Score..1.100.)) +
  geom_histogram(col="#660033")+
  labs(title="Distribution of the spending score")+xlab("Spending score")+  theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#### Density plots

#creating a density plot with a mean depicted as dashed line 
ggplot(customers, aes(x = Age)) +
  geom_vline(aes(xintercept = mean(Age), label ="mean"), color = "#660033", 
             linetype = "dashed", size = 1.0) +
  geom_density(alpha = 0.4, fill = "pink") +
  labs(title = "Density Distribution of Age Groups") + theme(plot.title = element_text(hjust = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning in geom_vline(aes(xintercept = mean(Age), label = "mean"), color =
## "#660033", : Ignoring unknown aesthetics: label

mean(customers$Age)
## [1] 38.85

In this plot we can see the countinuous probability distribution of an age variable with a mean of 38.85.

ggplot(customers, aes(x = Spending.Score..1.100.)) +
  geom_density(alpha=0.4, fill="#660033") +
  scale_x_continuous(breaks = seq(0, 200, by = 10)) + xlab("Spending score")+
  labs(title="Density Distribution of Spending Score")+theme(plot.title = element_text(hjust = 0.5))

Barplot

col <- c("pink", "#660033")
ggplot(customers, aes(x = Gender, fill = Gender)) +
  geom_bar(stat = "count", width = 0.5) +
  scale_fill_manual(values = col) +
  geom_text(stat = "count", aes(label = ..count../200), vjust = 1.6, color = "white", size = 3.5) +
  theme_minimal() +
  labs(title = "Gender Comparison", xlab = "Gender") +
  theme(plot.title = element_text(hjust = 0.5))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.

From the above plot we can see that customers dataset consist of 56% of female and 44% of male.

K-means

K-means is an unsupervised learning algorithm used to partition unlabelled data into a predefined number of “K” separate groups (clusters), where each element is classified to a cluster. Each cluster is related to a centroid. The K-means algorithm works as follows:

At the beginning, a K number of separate groups is initialized. Once the K number is defined, K-means initializes K centroids by randomly selecting data points. The next two steps are repeated until the algorithm converges, which means that the assignments of points do not change anymore.

The first step is the calculation of the distances between each data point and each cluster center using the Euclidean metric, followed by the assignment of data points to the closest cluster. Then, in the second step, the new centroids are calculated as the mean of the points that belong to that cluster.

One of the main drawbacks of K-means is its sensitivity to outliers. Extreme values can highly affect the mean measure, which is used to calculate the centroids.

Finding the optimal number of clusters for K-means

fviz_nbclust(d, kmeans) 

Clustering of the dataset

kresult<-eclust(d,"kmeans", hc_metric="euclidean",k=4)

kresult
## K-means clustering with 4 clusters of sizes 48, 48, 57, 47
## 
## Cluster means:
##   customers.Age customers.Spending.Score..1.100.
## 1      43.29167                         15.02083
## 2      55.70833                         48.22917
## 3      30.17544                         82.35088
## 4      27.61702                         49.14894
## 
## Clustering vector:
##   [1] 4 3 1 3 4 3 1 3 1 3 1 3 1 3 1 3 4 4 1 3 4 3 1 3 1 3 1 4 1 3 1 3 1 3 1 3 1
##  [38] 3 1 3 2 3 2 4 1 4 2 4 4 4 2 4 4 2 2 2 2 2 4 2 2 4 2 2 2 4 2 2 4 4 2 2 2 2
##  [75] 2 4 2 4 4 2 2 4 2 2 4 2 2 4 4 2 2 4 2 4 4 4 2 4 2 4 4 2 2 4 2 4 2 2 2 2 2
## [112] 4 4 4 4 4 2 2 2 2 4 4 4 3 4 3 2 3 1 3 1 3 4 3 1 3 1 3 1 3 1 3 4 3 1 3 2 3
## [149] 1 3 1 3 1 3 1 3 1 3 1 3 2 3 1 3 1 3 1 3 1 4 1 3 1 3 1 3 1 3 1 3 1 3 1 3 4
## [186] 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 10102.896  5694.396  6165.228  6203.064
##  (between_SS / total_SS =  83.6 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "clust_plot"  
## [11] "silinfo"      "nbclust"      "data"
fviz_cluster(kresult, main="kmeans / Euclidean",xlab="Age", ylab="Spending score")

Stripes plot

This plot shows the distances of the data points to cluster centroids.

#implementation of the plot using Euclidean distance metric
euc<-cclust(d, 4, dist="euclidean")
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
stripes(euc) 

# implementation of the stripes plot using Manhattan distance mteric
man<-cclust(d, 4, dist="manhattan")
stripes(man) 

Partitioning Around Medoids (PAM)

PAM also known as k-medoids is a method that is more robust ro noise and ouliers han k-means. In contrast to K-means, k-medoids intends to minimize the dissimilarities between data points within a cluster and center of he cluster, whereas, k-means minimizes the within-cluster variances. What is more k-medoids can be used with arbitrary distance measure. The medoid is data point chosen as a cluster center. The algorithm consists of following steps: 1.Build: Selecting k medoids using greedy algorithm and assigning every object to the closest one. 2.Swap: Calculating the cost for all non-medoid points, using arbirary chosen distance metric, to be the next medoid and choosing the one with te smallest cost. A huge advanatage of this method is the fact that the build phase arrives at the pretty good clustering already, since the first medoids are not random points.therefore the next phase being swap is only performed few times before the algorithm converges.

Finding the optimal number of clusters for PAM

fviz_nbclust(d, pam) 

Clustering implementation

Visualization
cust<-eclust(d, "pam", k=4) 

Elliptic Visualization
fviz_cluster(cust, geom="point", ellipse.type="norm") 

head(summary(cust))
## $medoids
##      customers.Age customers.Spending.Score..1.100.
## [1,]            27                               50
## [2,]            30                               83
## [3,]            43                               17
## [4,]            53                               46
## 
## $id.med
## [1]  98 200 151  60
## 
## $clustering
##   [1] 1 2 3 2 1 2 3 2 3 2 3 2 3 2 3 2 1 1 3 2 1 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3
##  [38] 2 3 2 4 2 4 1 3 1 4 1 1 1 4 1 1 4 4 4 4 4 1 4 4 1 4 4 4 1 4 4 1 1 4 4 4 4
##  [75] 4 1 4 4 1 4 4 1 4 4 1 4 4 1 1 4 4 1 4 4 1 1 4 1 4 1 1 4 4 1 4 1 4 4 4 4 4
## [112] 1 1 1 1 1 4 4 4 4 1 1 1 2 1 2 4 2 3 2 3 2 1 2 3 2 3 2 3 2 3 2 1 2 3 2 4 2
## [149] 3 2 3 2 3 2 3 2 3 2 3 2 4 2 3 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3 2 3 2 3 2 4
## [186] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
## 
## $objective
##    build     swap 
## 11.70431 10.80452 
## 
## $isolation
##  1  2  3  4 
## no no no no 
## Levels: no L L*
## 
## $clusinfo
##      size max_diss  av_diss diameter separation
## [1,]   44 21.37756 10.18784 37.12142   2.000000
## [2,]   57 16.76305  9.71942 31.78050   5.000000
## [3,]   48 26.83282 12.93507 48.83646   3.605551
## [4,]   51 19.72308 10.54409 34.20526   2.000000

Validating how well each observation is clustered

fviz_silhouette(cust)
##   cluster size ave.sil.width
## 1       1   44          0.45
## 2       2   57          0.59
## 3       3   48          0.45
## 4       4   51          0.49

Objects with Si (silhouette) close to 1 are adequately clustered, while observations with a small Si, around 0, lie between two clusters. Samples with negative Si are seemingly assigned to the wrong cluster.

Clustering using Manhattan distance metric

cust2<-eclust(d, "pam", k=4, hc_metric="manhattan")

head(summary(cust2))
## $medoids
##      customers.Age customers.Spending.Score..1.100.
## [1,]            27                               50
## [2,]            30                               83
## [3,]            43                               17
## [4,]            53                               46
## 
## $id.med
## [1]  98 200 151  60
## 
## $clustering
##   [1] 1 2 3 2 1 2 3 2 3 2 3 2 3 2 3 2 1 1 3 2 1 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3
##  [38] 2 3 2 4 2 4 1 3 1 4 1 1 1 4 1 1 4 4 4 4 4 1 4 4 1 4 4 4 1 4 4 1 1 4 4 4 4
##  [75] 4 1 4 4 1 4 4 1 4 4 1 4 4 1 1 4 4 1 4 4 1 1 4 1 4 1 1 4 4 1 4 1 4 4 4 4 4
## [112] 1 1 1 1 1 4 4 4 4 1 1 1 2 1 2 4 2 3 2 3 2 1 2 3 2 3 2 3 2 3 2 1 2 3 2 4 2
## [149] 3 2 3 2 3 2 3 2 3 2 3 2 4 2 3 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3 2 3 2 3 2 4
## [186] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
## 
## $objective
##    build     swap 
## 11.70431 10.80452 
## 
## $isolation
##  1  2  3  4 
## no no no no 
## Levels: no L L*
## 
## $clusinfo
##      size max_diss  av_diss diameter separation
## [1,]   44 21.37756 10.18784 37.12142   2.000000
## [2,]   57 16.76305  9.71942 31.78050   5.000000
## [3,]   48 26.83282 12.93507 48.83646   3.605551
## [4,]   51 19.72308 10.54409 34.20526   2.000000

Validating how well each observation is clustered

fviz_silhouette(cust2)
##   cluster size ave.sil.width
## 1       1   44          0.45
## 2       2   57          0.59
## 3       3   48          0.45
## 4       4   51          0.49

attributes(cust2)
## $names
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## [11] "clust_plot" "nbclust"   
## 
## $class
## [1] "pam"       "partition" "eclust"

As we can see from the plot, in this case, choosing the Manhattan distance measure does not have a substantial impact on the clustering accuracy.

Cluster medoids

pam.res <- pam(d, 2)
print(pam.res)
## Medoids:
##       ID customers.Age customers.Spending.Score..1.100.
## [1,] 127            43                               35
## [2,] 158            30                               78
## Clustering vector:
##   [1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
##  [38] 2 1 2 1 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1
##  [75] 1 2 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [112] 2 1 1 1 1 1 1 1 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
## [149] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
## [186] 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
## Objective function:
##    build     swap 
## 18.99659 17.51984 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
pam.res$medoids
##      customers.Age customers.Spending.Score..1.100.
## [1,]            43                               35
## [2,]            30                               78

Cluster numbers

head(pam.res$clustering)
## [1] 1 2 1 2 1 2

Visualization of PAM clusters

 fviz_cluster(pam.res)

CLARA (CLustering Large Applications)

Clara is an extension of the k-medoids method and is designed for clustering large datasets, even those with thousands of objects. The algorithm achieves this by dividing the dataset into subsets and performing PAM to generate an optimal set of medoids on a smaller sample. The algorithm works as follows:

1.Divide the original dataset into several subsets. 2.Apply PAM on each subset. 3.Measure the goodness of the clustering by calculating the dissimilarities of the objects to their closest medoids. 4.Repeat the sampling and clustering process a pre-defined number of times and select the set of medoids with the minimal cost.

Finding the optimal number of clusters for CLARA

fviz_nbclust(d, clara) 

Clara implementation

The output of CLARA method consists of the elements such as: medoids: Objects that represent clusters clustering: a vector containing the cluster number of each object sample: labels or case numbers of the observations in the best sample, that is, the sample used by the clara algorithm for the final partition.

clara.res<-clara(d, 4, metric = "euclidean", stand = FALSE, 
      samples = 5, pamLike = FALSE)
clara.res
## Call:     clara(x = d, k = 4, metric = "euclidean", stand = FALSE, samples = 5, pamLike = FALSE) 
## Medoids:
##      customers.Age customers.Spending.Score..1.100.
## [1,]            24                               52
## [2,]            30                               86
## [3,]            40                               13
## [4,]            50                               45
## Objective function:   11.09193
## Clustering vector:    int [1:200] 1 2 3 2 1 2 3 2 3 2 3 2 3 2 3 2 4 1 ...
## Cluster sizes:            41 55 43 61 
## Best sample:
##  [1]   3   4   8  12  14  19  23  27  33  41  43  44  53  55  56  58  62  63  70
## [20]  79  87  96 107 108 115 120 121 129 130 135 138 139 145 149 153 154 159 165
## [39] 167 171 173 174 176 179 181 183 185 199
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"

### Validating how well each observation is clustered

fviz_silhouette(clara.res)
##   cluster size ave.sil.width
## 1       1    8          0.68
## 2       2    9          0.50
## 3       3   17          0.45
## 4       4   14          0.38

Cluster medoids

clara.res$medoids
##      customers.Age customers.Spending.Score..1.100.
## [1,]            24                               52
## [2,]            30                               86
## [3,]            40                               13
## [4,]            50                               45

Cluster numbers

head(clara.res$clustering, 10)
##  [1] 1 2 3 2 1 2 3 2 3 2

Visualizing CLARA results

fviz_cluster(clara.res) 

Summary

In this paper, we implemented three clustering methods: K-means, K-medoids, and CLARA, to carry out customer segmentation. All of these algorithms identify patterns by assigning data points to clusters based on a measure of similarity or distance between the data points and a representative point. In K-means, this representative point is the mean of the data points in the cluster, while in K-medoids and CLARA, it is a medoid, which is the data point that is most centrally located within the cluster. Additionally, all three algorithms require a priori knowledge of the number of clusters. However, determining the optimal number of clusters can be difficult in practice. These algorithms are iterative and converge to a stable solution by updating the assignment of data points to clusters and the location of the representative points (centroids or medoids) until convergence.