Introduction

Customer segmentation is defined as the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately. In business-to-business marketing, a company might segment customers according to a wide range of factors shopify business encyclopedia 2021

Segmentation makes it easier for the marketing team to satisfy their customers by making marketing products that will reach the target group. Advertisement and other marketing communications are directed to the appropriate group of people.

Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we’re looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data Packt,2016

There are several ways of clustering and the most popular and widely used is K- means. This article is a brief tutorial on each method of clustering described here. the main aim is to educate beginners in how each method works using a simple example of customer data. The methods explored here are K-means, PAM and CLARA.

Dataset

For the purposes of this paper we used data available here. The data set was created for learning purposes and contains 5 columns, namely; customer ID, age, gender, annual income and spending score.

The table below shows the structure of the dataset.

head(mall)
##   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

Data transformation and cleaning

It is necessary to check and remove any missing variables. The variable gender2 is created in order to change the Gender variable to numeric values with o representating males and 1 Females. Data Cleaning is the process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity. We will perfom data transformation and cleaning.

any(is.na(mall))
## [1] FALSE
mall$gender2<- ifelse(mall$Gender=="Male",0,1)
as.factor(mall$gender2)
##   [1] 0 0 1 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 1 0 0 1 1 1
##  [38] 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 0 1 1 0 1 0 1 1 1
##  [75] 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 0
## [112] 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 1 0 0 0 1
## [149] 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 1
## [186] 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0
## Levels: 0 1

Summary for the Dataset

summary(mall)
##    CustomerID        Gender               Age        Annual.Income..k..
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50    
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00    
##  Spending.Score..1.100.    gender2    
##  Min.   : 1.00          Min.   :0.00  
##  1st Qu.:34.75          1st Qu.:0.00  
##  Median :50.00          Median :1.00  
##  Mean   :50.20          Mean   :0.56  
##  3rd Qu.:73.00          3rd Qu.:1.00  
##  Max.   :99.00          Max.   :1.00

Excluding ID variable

mallclus <- mall[,3:6]
summary(mallclus)
##       Age        Annual.Income..k.. Spending.Score..1.100.    gender2    
##  Min.   :18.00   Min.   : 15.00     Min.   : 1.00          Min.   :0.00  
##  1st Qu.:28.75   1st Qu.: 41.50     1st Qu.:34.75          1st Qu.:0.00  
##  Median :36.00   Median : 61.50     Median :50.00          Median :1.00  
##  Mean   :38.85   Mean   : 60.56     Mean   :50.20          Mean   :0.56  
##  3rd Qu.:49.00   3rd Qu.: 78.00     3rd Qu.:73.00          3rd Qu.:1.00  
##  Max.   :70.00   Max.   :137.00     Max.   :99.00          Max.   :1.00

Visualization of the data set

Plotting significant variables as a way to bring visualisation of the dataset

Bar Graphs

#customer gender
bars <- table(mall$Gender)
barplot(bars, main="Gender of customer",
        ylab="Number of Customers",
        xlab = " Gender",
        ylim =c(0,200), xlim = c(0,2),col = c('blue','green')) 

Histograms

#age
hist(mall$Age,main="Customer Age",
     col="red",
     ylim = c(0,80), xlim = c(22, 80),
     ylab="Number of Customers",
     xlab = "Age Range")


# Correlation Plots


```r
cor.matrix <- cor(mallclus, method = "pearson", use = "everything")
cormat <- cor(mallclus)
corrplot(cormat, type ="lower")

corrplot(cormat, method="number", type = "upper")

The above correlation matrix shows that there is approximately a score of -0.7 for age vs spending score.This brings about the conclusion that there as age increases the spending score decreases with a magnitude of 0.7 ratio. We also realise that there is low positive correlation between gender2 and spending score of approximately 0.2 which shows that there is a relationship between gender2 and spending score. There is low negative correlation between between gender2 and annual income which shows that there is no strong relationship between the two variables.

Elbow Method

#let's decide the maximum K to cluster. Say 10:
k.max <- 10

#we will create a vector of the total within sum of squars, in order to visulize it
wss <- sapply(1:k.max, function(k){kmeans(mallclus, k,
                                          nstart=50,iter.max = 1000 )$tot.withinss})

wss
##  [1] 308862.06 212889.44 143391.59 104414.68  75399.62  58348.64  51130.69
##  [8]  44355.31  40615.15  37061.44
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE,
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

According to Geeks for geeks Feb, 2021, a fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k. To determine the optimal number of clusters we find the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear manner. For this case we will take k to be 5.

Total within-clusters sum of Squares

#usage of Total within-clusters sum of Squares method
AA <- fviz_nbclust(mallclus, FUNcluster = kmeans, method = "wss") +
        ggtitle("Optimal number of clusters \n K-means")
BB <- fviz_nbclust(mallclus, FUNcluster = cluster::pam, method = "wss") +
        ggtitle("Optimal number of clusters \n PAM")
CC <- fviz_nbclust(mallclus, FUNcluster = cluster::clara, method = "wss") +
        ggtitle("Optimal number of clusters \n CLARA")

grid.arrange(AA, BB, CC, ncol=3)

The Total within-clusters sum of Squares method is plotted for K means, PAM and CLARA. It is not apparent which 5 and 7 clusters. The Silhouette Score Plot is then used to investigate further.

Silhouette Score Plot

AA <- fviz_nbclust(mallclus, FUNcluster = kmeans, method = "silhouette") +
        ggtitle("Optimal number of clusters \n K-means")
BB <- fviz_nbclust(mallclus, FUNcluster = cluster::pam, method = "silhouette") +
        ggtitle("Optimal number of clusters \n PAM")
CC <- fviz_nbclust(mallclus, FUNcluster = cluster::clara, method = "silhouette") +
        ggtitle("Optimal number of clusters \n CLARA")

grid.arrange(AA, BB, CC, ncol=3)

In the silhouette score plot all the methods; K-Means, PAM and CLARA shows that the optimal number of clusters is 6, therefore we will use k=6 for the cluster analysis.

K-Means

k6 <- eclust(mallclus, k=6 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(k6, data=mallclus, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 6 clusters")
s2 <- fviz_silhouette(k6)
##   cluster size ave.sil.width
## 1       1   11          0.25
## 2       2   17          0.32
## 3       3   28          0.51
## 4       4   10          0.28
## 5       5   39          0.57
## 6       6   95          0.22
grid.arrange(c2, s2, ncol=2)

PAM

p6 <- eclust(mallclus, k=6 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp2 <- fviz_cluster(p6, data=mallclus, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 6 clusters")
sp2 <- fviz_silhouette(p6)
##   cluster size ave.sil.width
## 1       1   25          0.46
## 2       2   20          0.45
## 3       3   44          0.45
## 4       4   37          0.41
## 5       5   39          0.50
## 6       6   35          0.41
grid.arrange(cp2, sp2, ncol=2)

CLARA

cla6 <- eclust(mallclus, k=6 , FUNcluster="clara", hc_metric="euclidean", graph=F)
cc2 <- fviz_cluster(cla6, data=mallclus, elipse.type="norm", geom=c("point")) + ggtitle("CLARA with 6 clusters")
sc2 <- fviz_silhouette(cla6)
##   cluster size ave.sil.width
## 1       1   22          0.40
## 2       2   21          0.60
## 3       3   46          0.44
## 4       4   36          0.40
## 5       5   39          0.50
## 6       6   36          0.40
grid.arrange(cc2, sc2, ncol=2)

Conclusion

Both PAM and CLARA have an average silhouette width of 0.45 which is higher than K-Means which has an average silhouette width of 0.34.

In the K-Means plot in the 1st and 6th cluster there is a part thatis below the zero line. In PAM there is a large drop below zero of approximately -0.25 in the 1st cluster, there is also a drop in the 2nd cluster and in the 6th cluster. On the other hand in CLARA there are drops in cluster 1,2,4 and 6 but none of them seems to be very significant except for cluster 6.

We can therefore conclude that CLARA is the better method compared to PAM. It has better fitting.

References

  1. https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python

  2. https://www.shopify.com/encyclopedia/customer-segmentation#:~:text=Customer%20segmentation%20is%20the%20process,Number%20of%20employees

  3. https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/