Introduction

Segmentation of customers is one of the business areas where clustering methods can be very useful. Thanks to this unsupervised learning technique, a company can adjust marketing strategy in more efficient manner and focus on those customers who will ensure the highest revenue.In this paper I will describe how to implement 2 different clustering algorithms (k-means and hierarchical clustering) to segment the customers into different groups, according to their spendings and income.

Exploratory Data Analysis

Firstly, let’s look at the dataset, to see what kind of groups we can expect to be formed when we apply the clustering methods on a later stage.

customers <- read.csv('/users/benna/downloads/Mall_Customers.csv') 
summary(customers)

Let’s change names of two variables, in order to have easier access in the next analysis. In addition let’s exclude the variable ‘CustomerID’ as we will analyse groups of customers in this paper and therefore information about a particular customer ID won’t be useful for us.

names(customers)[4] <- paste('AnnualIncome') 
names(customers)[5] <- paste('SpendingScore') 
customers <- customers[,2:5]
summary(customers)

The table above indicates, that in total we have 200 customers in our dataset, from which 112 are women and 88 are men. The age of the customers varies from 18 to 70 and the annual income from 15 to 137 (given in thousands). The variable SpendingScore is a score assigned by the mall based on the customer behaviour and spending nature, where 99 is a maximum and 1 is a minimum value. In order to better understand the data, let’s draw a few histograms of given variables.

hist(customers$Age,
     col = 'red',  
     main = 'Age of Customers',    
     xlab = 'Age')

Looking at the age of customers, one can see that the biggest group is people from 30 to 35 years old (more than 35 members). The distribution of this variable is skewed to the right, so there is far more people on the left hand-side (40 years old or younger) than on the right-hand side.

hist(customers$AnnualIncome, 
    col = 'blue',    
    main = 'Annual Income of Customers',    
    xlab = 'Annual Income')

When it comes to annual income of customers, the distribution looks somewhat smiliar to the previous one, although the difference between left and right-hand side of the histogram is even greater. There is a lot of customers, that earn up to 80 thousands annually. Earnings greater than this amount are far less popular than smaller incomes.

hist(customers$SpendingScore,    
     col = 'orange',    
     main = 'Spending Score of Customers',  
     xlab = 'Spending Score')

Finally, the spending score of customers resembles the normal distrbution the most out of these three. Most of the observations are concentrated around the mean (50) and both sides of the graph look very similar. What is interesting, there are many observations at both ends, meaning that there is a quite a lot of customers who spend very little or very much.Next step of my analysis would be to plot Annual Income and Spending Score in respect to Age and Gender. This could allow to make first more meaningful conclusions from the mall customers dataset.

library(ggplot2)
## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'
ggplot(customers) +
   geom_point(aes(x = Age, y = AnnualIncome, col = Gender)) 
ggplot(customers) +
  geom_point(aes(x = Age, y = SpendingScore, col = Gender))
ggplot(customers) +  
  geom_point(aes(x = AnnualIncome, y = SpendingScore, col = Gender))

From the first plot, we can conclude that the highest income can be observed for people with age from 30 to 50. From the second plot we find out that, all the ‘big spenders’ have not more than 40 years old. Customers above that age, tend to be more frugal as the highest values of Spending Score are around 60 points. But the most interesting patterns are presented on the last plot. We can see that observations tend to group themselves in a couple of areas on the graph. There is a numerous group right in the middle and a few groups in the corners of the plot. Gender seems to have little effect when income and spending of customers is analysed.

K-means clustering

I’ll use k-means algorithm, a method which identifies k number of centroids (centers of a cluster) and allocates every single observation (data point) to the nearest cluster. In the previous part of my paper we found out that two variables: AnnualIncome and SpendingScore are the ones that influence consumer behaviour the most. Therefore the clusters will be generated only on the basis of these two variables.

First, it is crucial to determine the most optimal number of clusters. In order to do so I’ll use so called elbow method from the ClusterR package, which plots the explained variation as a function of the number of clusters. The ‘elbow’ of the plot should correspond to the most optimal number of clusters.

library(ClusterR)
opt <- Optimal_Clusters_KMeans(customers[, 3:4], max_clusters = 10, plot_clusters = T)

From the plot above, it is quite hard to clearly define where does the ‘elbow’ occur. Let’s use another method called ‘silhouette’ to identify the number of clusters. It measures how similar is an object to its own cluster compared to other clusters. I’ll plot the average sillhoute value for k ranging from 2 to 10 and the highest value received should help us determine most optimal number of clusters.

opt <- Optimal_Clusters_KMeans(customers[, 3:4], max_clusters = 10, plot_clusters = T, criterion = 'silhouette')

The highest average sillhoute value (equal to 0.54) is present for k = 5. Therefore we should opt for 5 clusters in our further analysis with k-means algorithm. In the next step we will add the cluster number to each observation in our dataframe and plot the observations in respect to their cluster with ggplot.

set.seed(22) 
km <- kmeans(customers[,3:4], 5)
customers$ClusterNumber <- km$cluster  

ggplot(customers[,3:5])  + 
   geom_point(aes(x = AnnualIncome, y = SpendingScore, col = as.factor  (ClusterNumber))) +   
   scale_color_discrete(name="Cluster Number")

The plot above clearly indicates that k-means algorithm distinguished 5 groups of the customers. There are 2 different groups of customers on the left-hand side, with small incomes and spending scores being high in one of the group and low in the other. Similarly, we can see 2 groups that were formed on the right-hand side of our graph, this time including peple that earnings are above the average. Finally, there is one group in the middle, with customers who have medium annual earnings and also their spending habits are somehow in the middle of the whole population. In the next part of my paper, I will perform the hierarchical clustering and compare the results of both methods.

Hierarchical clustering

In this part of customers segmentation analysis, I will use agglomerative hierarchical clustering (also known as bottom-up approach), a method used o group objects based on their similarity. At the beginning each observation starts in its own cluster, and step by step pairs of clusters are merged as we move up the hierarchy. Before I implement the clustering algorithm, I will compare the distances between the data points and in the next step the ‘hclust’ function will be used to perform the cluster analysis.

dist_customers <- dist(customers[,3:4]) 
hc_customers <- hclust(dist_customers, method = 'complete')

Now it’s the time to choose the optimal number of clusters. In order to do that, similarly as in the k-means method, I will use the silhoutte criterion.

library(cluster)
for(i in 2:7) {    
  nam <- paste("clust", i, sep = "")   
  assign(nam, cutree(hc_customers, k = i)) 
}  

par(mfrow = c(3, 2))  

plot(silhouette(clust2, dist_customers), col = "blue") 
plot(silhouette(clust3, dist_customers), col = 'blue')
plot(silhouette(clust4, dist_customers), col = "blue") 
plot(silhouette(clust5, dist_customers), col = "blue") 
plot(silhouette(clust6, dist_customers), col = "blue") 
plot(silhouette(clust7, dist_customers), col = "blue")

It occurs, that here we should also opt for 5 clusters as the average silhouette width is the biggest for k = 5. Now we are ready to plot the dendrogram. To make it more meaningful I will use colors in order to indicate which observations fall to which cluster.