In general, clustering is the problem of finding groups within the elements in the data. The main objective of clustering algorithms is to study the structure of data in order to classify objects in a way such that those in the same group are more similar to each other than objects in other groups. The clustering task is an unsupervised machine learning field of interest since when dealing with real-world problems, data does not have predefined labels. Customer segmentation is one such problem, the goal of which is to group customers with similar tastes. This, in turn, can be utilized by sellers to achieve more efficient marketing by presenting more personalized offers. In this paper, we will perform customer segmentation using clustering techniques, such as:
library(cluster)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(ClusterR)
library(flexclust)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The dataset used in this paper consists of 200 observations and 5 variables, namely customers’ gender, age, annual income, spending score, and cluster (although the last variable will not be used).
#reading the dataset
customers<-read.csv("customers.csv", sep=",", dec=".", header=TRUE)
head(customers)
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100. cluster
## 1 1 Male 19 15 39 4
## 2 2 Male 21 15 81 3
## 3 3 Female 20 16 6 4
## 4 4 Female 23 16 77 3
## 5 5 Female 31 17 40 4
## 6 6 Female 22 17 76 3
#getting into statistics details of the dataset
summary(customers)
## CustomerID Gender Age Annual.Income..k..
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending.Score..1.100. cluster
## Min. : 1.00 Min. :0.000
## 1st Qu.:34.75 1st Qu.:1.000
## Median :50.00 Median :2.000
## Mean :50.20 Mean :2.245
## 3rd Qu.:73.00 3rd Qu.:4.000
## Max. :99.00 Max. :5.000
#printing the dimensions of the dataset
dim(customers)
## [1] 200 6
#checking whether there are any NA values
sum(is.na(customers))
## [1] 0
Based on the summary, we can see that this dataset consists of customers aged between 18 and 70, and the average age of a customer in the dataset is 38.85.
plot(customers$Age, customers$Spending.Score..1.100., main = "Customers spending score depending on the age", xlab="Age", ylab="Spending score", col="pink")
d<-data.frame(customers$Age, customers$Spending.Score..1.100.)
From the plot, we can see that customers aged below 40 have the highest spending scores.
ggplot(customers,aes(x=Age)) +
geom_histogram(col="pink")+
labs(title="Distribution of the Age")+ theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(customers,aes(x=Spending.Score..1.100.)) +
geom_histogram(col="#660033")+
labs(title="Distribution of the spending score")+xlab("Spending score")+ theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#### Density plots
#creating a density plot with a mean depicted as dashed line
ggplot(customers, aes(x = Age)) +
geom_vline(aes(xintercept = mean(Age), label ="mean"), color = "#660033",
linetype = "dashed", size = 1.0) +
geom_density(alpha = 0.4, fill = "pink") +
labs(title = "Density Distribution of Age Groups") + theme(plot.title = element_text(hjust = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning in geom_vline(aes(xintercept = mean(Age), label = "mean"), color =
## "#660033", : Ignoring unknown aesthetics: label
mean(customers$Age)
## [1] 38.85
In this plot we can see the countinuous probability distribution of an age variable with a mean of 38.85.
ggplot(customers, aes(x = Spending.Score..1.100.)) +
geom_density(alpha=0.4, fill="#660033") +
scale_x_continuous(breaks = seq(0, 200, by = 10)) + xlab("Spending score")+
labs(title="Density Distribution of Spending Score")+theme(plot.title = element_text(hjust = 0.5))
col <- c("pink", "#660033")
ggplot(customers, aes(x = Gender, fill = Gender)) +
geom_bar(stat = "count", width = 0.5) +
scale_fill_manual(values = col) +
geom_text(stat = "count", aes(label = ..count../200), vjust = 1.6, color = "white", size = 3.5) +
theme_minimal() +
labs(title = "Gender Comparison", xlab = "Gender") +
theme(plot.title = element_text(hjust = 0.5))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
From the above plot we can see that customers dataset consist of 56% of female and 44% of male.
K-means is an unsupervised learning algorithm used to partition unlabelled data into a predefined number of “K” separate groups (clusters), where each element is classified to a cluster. Each cluster is related to a centroid. The K-means algorithm works as follows:
At the beginning, a K number of separate groups is initialized. Once the K number is defined, K-means initializes K centroids by randomly selecting data points. The next two steps are repeated until the algorithm converges, which means that the assignments of points do not change anymore.
The first step is the calculation of the distances between each data point and each cluster center using the Euclidean metric, followed by the assignment of data points to the closest cluster. Then, in the second step, the new centroids are calculated as the mean of the points that belong to that cluster.
One of the main drawbacks of K-means is its sensitivity to outliers. Extreme values can highly affect the mean measure, which is used to calculate the centroids.
fviz_nbclust(d, kmeans)
kresult<-eclust(d,"kmeans", hc_metric="euclidean",k=4)
kresult
## K-means clustering with 4 clusters of sizes 48, 48, 57, 47
##
## Cluster means:
## customers.Age customers.Spending.Score..1.100.
## 1 43.29167 15.02083
## 2 55.70833 48.22917
## 3 30.17544 82.35088
## 4 27.61702 49.14894
##
## Clustering vector:
## [1] 4 3 1 3 4 3 1 3 1 3 1 3 1 3 1 3 4 4 1 3 4 3 1 3 1 3 1 4 1 3 1 3 1 3 1 3 1
## [38] 3 1 3 2 3 2 4 1 4 2 4 4 4 2 4 4 2 2 2 2 2 4 2 2 4 2 2 2 4 2 2 4 4 2 2 2 2
## [75] 2 4 2 4 4 2 2 4 2 2 4 2 2 4 4 2 2 4 2 4 4 4 2 4 2 4 4 2 2 4 2 4 2 2 2 2 2
## [112] 4 4 4 4 4 2 2 2 2 4 4 4 3 4 3 2 3 1 3 1 3 4 3 1 3 1 3 1 3 1 3 4 3 1 3 2 3
## [149] 1 3 1 3 1 3 1 3 1 3 1 3 2 3 1 3 1 3 1 3 1 4 1 3 1 3 1 3 1 3 1 3 1 3 1 3 4
## [186] 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3
##
## Within cluster sum of squares by cluster:
## [1] 10102.896 5694.396 6165.228 6203.064
## (between_SS / total_SS = 83.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "clust_plot"
## [11] "silinfo" "nbclust" "data"
fviz_cluster(kresult, main="kmeans / Euclidean",xlab="Age", ylab="Spending score")
This plot shows the distances of the data points to cluster centroids.
#implementation of the plot using Euclidean distance metric
euc<-cclust(d, 4, dist="euclidean")
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
stripes(euc)
# implementation of the stripes plot using Manhattan distance mteric
man<-cclust(d, 4, dist="manhattan")
stripes(man)
PAM also known as k-medoids is a method that is more robust ro noise and ouliers han k-means. In contrast to K-means, k-medoids intends to minimize the dissimilarities between data points within a cluster and center of he cluster, whereas, k-means minimizes the within-cluster variances. What is more k-medoids can be used with arbitrary distance measure. The medoid is data point chosen as a cluster center. The algorithm consists of following steps: 1.Build: Selecting k medoids using greedy algorithm and assigning every object to the closest one. 2.Swap: Calculating the cost for all non-medoid points, using arbirary chosen distance metric, to be the next medoid and choosing the one with te smallest cost. A huge advanatage of this method is the fact that the build phase arrives at the pretty good clustering already, since the first medoids are not random points.therefore the next phase being swap is only performed few times before the algorithm converges.
fviz_nbclust(d, pam)
cust<-eclust(d, "pam", k=4)
fviz_cluster(cust, geom="point", ellipse.type="norm")
head(summary(cust))
## $medoids
## customers.Age customers.Spending.Score..1.100.
## [1,] 27 50
## [2,] 30 83
## [3,] 43 17
## [4,] 53 46
##
## $id.med
## [1] 98 200 151 60
##
## $clustering
## [1] 1 2 3 2 1 2 3 2 3 2 3 2 3 2 3 2 1 1 3 2 1 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3
## [38] 2 3 2 4 2 4 1 3 1 4 1 1 1 4 1 1 4 4 4 4 4 1 4 4 1 4 4 4 1 4 4 1 1 4 4 4 4
## [75] 4 1 4 4 1 4 4 1 4 4 1 4 4 1 1 4 4 1 4 4 1 1 4 1 4 1 1 4 4 1 4 1 4 4 4 4 4
## [112] 1 1 1 1 1 4 4 4 4 1 1 1 2 1 2 4 2 3 2 3 2 1 2 3 2 3 2 3 2 3 2 1 2 3 2 4 2
## [149] 3 2 3 2 3 2 3 2 3 2 3 2 4 2 3 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3 2 3 2 3 2 4
## [186] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
##
## $objective
## build swap
## 11.70431 10.80452
##
## $isolation
## 1 2 3 4
## no no no no
## Levels: no L L*
##
## $clusinfo
## size max_diss av_diss diameter separation
## [1,] 44 21.37756 10.18784 37.12142 2.000000
## [2,] 57 16.76305 9.71942 31.78050 5.000000
## [3,] 48 26.83282 12.93507 48.83646 3.605551
## [4,] 51 19.72308 10.54409 34.20526 2.000000
fviz_silhouette(cust)
## cluster size ave.sil.width
## 1 1 44 0.45
## 2 2 57 0.59
## 3 3 48 0.45
## 4 4 51 0.49
Objects with Si (silhouette) close to 1 are adequately clustered, while observations with a small Si, around 0, lie between two clusters. Samples with negative Si are seemingly assigned to the wrong cluster.
cust2<-eclust(d, "pam", k=4, hc_metric="manhattan")
head(summary(cust2))
## $medoids
## customers.Age customers.Spending.Score..1.100.
## [1,] 27 50
## [2,] 30 83
## [3,] 43 17
## [4,] 53 46
##
## $id.med
## [1] 98 200 151 60
##
## $clustering
## [1] 1 2 3 2 1 2 3 2 3 2 3 2 3 2 3 2 1 1 3 2 1 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3
## [38] 2 3 2 4 2 4 1 3 1 4 1 1 1 4 1 1 4 4 4 4 4 1 4 4 1 4 4 4 1 4 4 1 1 4 4 4 4
## [75] 4 1 4 4 1 4 4 1 4 4 1 4 4 1 1 4 4 1 4 4 1 1 4 1 4 1 1 4 4 1 4 1 4 4 4 4 4
## [112] 1 1 1 1 1 4 4 4 4 1 1 1 2 1 2 4 2 3 2 3 2 1 2 3 2 3 2 3 2 3 2 1 2 3 2 4 2
## [149] 3 2 3 2 3 2 3 2 3 2 3 2 4 2 3 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3 2 3 2 3 2 4
## [186] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
##
## $objective
## build swap
## 11.70431 10.80452
##
## $isolation
## 1 2 3 4
## no no no no
## Levels: no L L*
##
## $clusinfo
## size max_diss av_diss diameter separation
## [1,] 44 21.37756 10.18784 37.12142 2.000000
## [2,] 57 16.76305 9.71942 31.78050 5.000000
## [3,] 48 26.83282 12.93507 48.83646 3.605551
## [4,] 51 19.72308 10.54409 34.20526 2.000000
fviz_silhouette(cust2)
## cluster size ave.sil.width
## 1 1 44 0.45
## 2 2 57 0.59
## 3 3 48 0.45
## 4 4 51 0.49
attributes(cust2)
## $names
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
## [11] "clust_plot" "nbclust"
##
## $class
## [1] "pam" "partition" "eclust"
As we can see from the plot, in this case, choosing the Manhattan distance measure does not have a substantial impact on the clustering accuracy.
pam.res <- pam(d, 2)
print(pam.res)
## Medoids:
## ID customers.Age customers.Spending.Score..1.100.
## [1,] 127 43 35
## [2,] 158 30 78
## Clustering vector:
## [1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
## [38] 2 1 2 1 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1
## [75] 1 2 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [112] 2 1 1 1 1 1 1 1 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
## [149] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
## [186] 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
## Objective function:
## build swap
## 18.99659 17.51984
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
pam.res$medoids
## customers.Age customers.Spending.Score..1.100.
## [1,] 43 35
## [2,] 30 78
head(pam.res$clustering)
## [1] 1 2 1 2 1 2
fviz_cluster(pam.res)
Clara is an extension of the k-medoids method and is designed for clustering large datasets, even those with thousands of objects. The algorithm achieves this by dividing the dataset into subsets and performing PAM to generate an optimal set of medoids on a smaller sample. The algorithm works as follows:
1.Divide the original dataset into several subsets. 2.Apply PAM on each subset. 3.Measure the goodness of the clustering by calculating the dissimilarities of the objects to their closest medoids. 4.Repeat the sampling and clustering process a pre-defined number of times and select the set of medoids with the minimal cost.
fviz_nbclust(d, clara)
The output of CLARA method consists of the elements such as: medoids: Objects that represent clusters clustering: a vector containing the cluster number of each object sample: labels or case numbers of the observations in the best sample, that is, the sample used by the clara algorithm for the final partition.
clara.res<-clara(d, 4, metric = "euclidean", stand = FALSE,
samples = 5, pamLike = FALSE)
clara.res
## Call: clara(x = d, k = 4, metric = "euclidean", stand = FALSE, samples = 5, pamLike = FALSE)
## Medoids:
## customers.Age customers.Spending.Score..1.100.
## [1,] 24 52
## [2,] 30 86
## [3,] 40 13
## [4,] 50 45
## Objective function: 11.09193
## Clustering vector: int [1:200] 1 2 3 2 1 2 3 2 3 2 3 2 3 2 3 2 4 1 ...
## Cluster sizes: 41 55 43 61
## Best sample:
## [1] 3 4 8 12 14 19 23 27 33 41 43 44 53 55 56 58 62 63 70
## [20] 79 87 96 107 108 115 120 121 129 130 135 138 139 145 149 153 154 159 165
## [39] 167 171 173 174 176 179 181 183 185 199
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
fviz_silhouette(clara.res)
## cluster size ave.sil.width
## 1 1 8 0.68
## 2 2 9 0.50
## 3 3 17 0.45
## 4 4 14 0.38
clara.res$medoids
## customers.Age customers.Spending.Score..1.100.
## [1,] 24 52
## [2,] 30 86
## [3,] 40 13
## [4,] 50 45
head(clara.res$clustering, 10)
## [1] 1 2 3 2 1 2 3 2 3 2
fviz_cluster(clara.res)
In this paper, we implemented three clustering methods: K-means, K-medoids, and CLARA, to carry out customer segmentation. All of these algorithms identify patterns by assigning data points to clusters based on a measure of similarity or distance between the data points and a representative point. In K-means, this representative point is the mean of the data points in the cluster, while in K-medoids and CLARA, it is a medoid, which is the data point that is most centrally located within the cluster. Additionally, all three algorithms require a priori knowledge of the number of clusters. However, determining the optimal number of clusters can be difficult in practice. These algorithms are iterative and converge to a stable solution by updating the assignment of data points to clusters and the location of the representative points (centroids or medoids) until convergence.
https://towardsdatascience.com/unsupervised-machine-learning-clustering-analysis-d40f2b34ae7e https://en.wikipedia.org/wiki/Cluster_analysis https://en.wikipedia.org/wiki/K-means_clustering https://towardsdatascience.com/k-medoid-clustering-pam-algorithm-in-python-with-solved-example-c0dcb35b3f46 https://towardsdatascience.com/a-deep-dive-into-partitioning-around-medoids-a77d9b888881 https://www.datanovia.com/en/lessons/clara-in-r-clustering-large-applications/