Customer segmentation is defined as the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately. In business-to-business marketing, a company might segment customers according to a wide range of factors shopify business encyclopedia 2021
Segmentation makes it easier for the marketing team to satisfy their customers by making marketing products that will reach the target group. Advertisement and other marketing communications are directed to the appropriate group of people.
Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we’re looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data Packt,2016
There are several ways of clustering and the most popular and widely used is K- means. This article is a brief tutorial on each method of clustering described here. the main aim is to educate beginners in how each method works using a simple example of customer data. The methods explored here are K-means, PAM and CLARA.
For the purposes of this paper we used data available here. The data set was created for learning purposes and contains 5 columns, namely; customer ID, age, gender, annual income and spending score.
The table below shows the structure of the dataset.
head(mall)
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
It is necessary to check and remove any missing variables. The variable gender2 is created in order to change the Gender variable to numeric values with o representating males and 1 Females. Data Cleaning is the process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity. We will perfom data transformation and cleaning.
any(is.na(mall))
## [1] FALSE
mall$gender2<- ifelse(mall$Gender=="Male",0,1)
as.factor(mall$gender2)
## [1] 0 0 1 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 1 0 0 1 1 1
## [38] 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 0 1 1 0 1 0 1 1 1
## [75] 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 0
## [112] 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 1 0 0 0 1
## [149] 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 1
## [186] 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0
## Levels: 0 1
summary(mall)
## CustomerID Gender Age Annual.Income..k..
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending.Score..1.100. gender2
## Min. : 1.00 Min. :0.00
## 1st Qu.:34.75 1st Qu.:0.00
## Median :50.00 Median :1.00
## Mean :50.20 Mean :0.56
## 3rd Qu.:73.00 3rd Qu.:1.00
## Max. :99.00 Max. :1.00
Excluding ID variable
mallclus <- mall[,3:6]
summary(mallclus)
## Age Annual.Income..k.. Spending.Score..1.100. gender2
## Min. :18.00 Min. : 15.00 Min. : 1.00 Min. :0.00
## 1st Qu.:28.75 1st Qu.: 41.50 1st Qu.:34.75 1st Qu.:0.00
## Median :36.00 Median : 61.50 Median :50.00 Median :1.00
## Mean :38.85 Mean : 60.56 Mean :50.20 Mean :0.56
## 3rd Qu.:49.00 3rd Qu.: 78.00 3rd Qu.:73.00 3rd Qu.:1.00
## Max. :70.00 Max. :137.00 Max. :99.00 Max. :1.00
Plotting significant variables as a way to bring visualisation of the dataset
#customer gender
bars <- table(mall$Gender)
barplot(bars, main="Gender of customer",
ylab="Number of Customers",
xlab = " Gender",
ylim =c(0,200), xlim = c(0,2),col = c('blue','green'))
#age
hist(mall$Age,main="Customer Age",
col="red",
ylim = c(0,80), xlim = c(22, 80),
ylab="Number of Customers",
xlab = "Age Range")
# Correlation Plots
```r
cor.matrix <- cor(mallclus, method = "pearson", use = "everything")
cormat <- cor(mallclus)
corrplot(cormat, type ="lower")
corrplot(cormat, method="number", type = "upper")
The above correlation matrix shows that there is approximately a score of -0.7 for age vs spending score.This brings about the conclusion that there as age increases the spending score decreases with a magnitude of 0.7 ratio. We also realise that there is low positive correlation between gender2 and spending score of approximately 0.2 which shows that there is a relationship between gender2 and spending score. There is low negative correlation between between gender2 and annual income which shows that there is no strong relationship between the two variables.
#let's decide the maximum K to cluster. Say 10:
k.max <- 10
#we will create a vector of the total within sum of squars, in order to visulize it
wss <- sapply(1:k.max, function(k){kmeans(mallclus, k,
nstart=50,iter.max = 1000 )$tot.withinss})
wss
## [1] 308862.06 212889.44 143391.59 104414.68 75399.62 58348.64 51130.69
## [8] 44355.31 40615.15 37061.44
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
According to Geeks for geeks Feb, 2021, a fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k. To determine the optimal number of clusters we find the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear manner. For this case we will take k to be 5.
#usage of Total within-clusters sum of Squares method
AA <- fviz_nbclust(mallclus, FUNcluster = kmeans, method = "wss") +
ggtitle("Optimal number of clusters \n K-means")
BB <- fviz_nbclust(mallclus, FUNcluster = cluster::pam, method = "wss") +
ggtitle("Optimal number of clusters \n PAM")
CC <- fviz_nbclust(mallclus, FUNcluster = cluster::clara, method = "wss") +
ggtitle("Optimal number of clusters \n CLARA")
grid.arrange(AA, BB, CC, ncol=3)
The Total within-clusters sum of Squares method is plotted for K means, PAM and CLARA. It is not apparent which 5 and 7 clusters. The Silhouette Score Plot is then used to investigate further.
AA <- fviz_nbclust(mallclus, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
BB <- fviz_nbclust(mallclus, FUNcluster = cluster::pam, method = "silhouette") +
ggtitle("Optimal number of clusters \n PAM")
CC <- fviz_nbclust(mallclus, FUNcluster = cluster::clara, method = "silhouette") +
ggtitle("Optimal number of clusters \n CLARA")
grid.arrange(AA, BB, CC, ncol=3)
In the silhouette score plot all the methods; K-Means, PAM and CLARA shows that the optimal number of clusters is 6, therefore we will use k=6 for the cluster analysis.
k6 <- eclust(mallclus, k=6 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(k6, data=mallclus, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 6 clusters")
s2 <- fviz_silhouette(k6)
## cluster size ave.sil.width
## 1 1 11 0.25
## 2 2 17 0.32
## 3 3 28 0.51
## 4 4 10 0.28
## 5 5 39 0.57
## 6 6 95 0.22
grid.arrange(c2, s2, ncol=2)
p6 <- eclust(mallclus, k=6 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(p6, data=mallclus, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 6 clusters")
sp2 <- fviz_silhouette(p6)
## cluster size ave.sil.width
## 1 1 25 0.46
## 2 2 20 0.45
## 3 3 44 0.45
## 4 4 37 0.41
## 5 5 39 0.50
## 6 6 35 0.41
grid.arrange(cp2, sp2, ncol=2)
cla6 <- eclust(mallclus, k=6 , FUNcluster="clara", hc_metric="euclidean", graph=F)
cc2 <- fviz_cluster(cla6, data=mallclus, elipse.type="norm", geom=c("point")) + ggtitle("CLARA with 6 clusters")
sc2 <- fviz_silhouette(cla6)
## cluster size ave.sil.width
## 1 1 22 0.40
## 2 2 21 0.60
## 3 3 46 0.44
## 4 4 36 0.40
## 5 5 39 0.50
## 6 6 36 0.40
grid.arrange(cc2, sc2, ncol=2)
Both PAM and CLARA have an average silhouette width of 0.45 which is higher than K-Means which has an average silhouette width of 0.34.
In the K-Means plot in the 1st and 6th cluster there is a part thatis below the zero line. In PAM there is a large drop below zero of approximately -0.25 in the 1st cluster, there is also a drop in the 2nd cluster and in the 6th cluster. On the other hand in CLARA there are drops in cluster 1,2,4 and 6 but none of them seems to be very significant except for cluster 6.
We can therefore conclude that CLARA is the better method compared to PAM. It has better fitting.
https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python
https://www.shopify.com/encyclopedia/customer-segmentation#:~:text=Customer%20segmentation%20is%20the%20process,Number%20of%20employees
https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/