Introduction

Segmentation of customers is one of the business areas where clustering methods can be very useful. Thanks to this unsupervised learning technique, a company can adjust marketing strategy in more efficient manner and focus on those customers who will ensure the highest revenue.

In this paper I will describe how to implement 2 different clustering algorithms (k-means and hierarchical clustering) to segment the customers into different groups, according to their spendings and income.

Exploratory Data Analysis

Firstly, let’s look at the dataset, to see what kind of groups we can expect to be formed when we apply the clustering methods on a later stage.

customers <- read.csv('Mall_Customers.csv')
summary(customers)
##    CustomerID        Gender         Age        Annual.Income..k..
##  Min.   :  1.00   Female:112   Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Male  : 88   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50                Median :36.00   Median : 61.50    
##  Mean   :100.50                Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                Max.   :70.00   Max.   :137.00    
##  Spending.Score..1.100.
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00

Let’s change names of two variables, in order to have easier access in the next analysis. In addition let’s exclude the variable ‘CustomerID’ as we will analyse groups of customers in this paper and therefore information about a particular customer ID won’t be useful for us.

names(customers)[4] <- paste('AnnualIncome')
names(customers)[5] <- paste('SpendingScore')
customers <- customers[,2:5]
summary(customers)
##     Gender         Age         AnnualIncome    SpendingScore  
##  Female:112   Min.   :18.00   Min.   : 15.00   Min.   : 1.00  
##  Male  : 88   1st Qu.:28.75   1st Qu.: 41.50   1st Qu.:34.75  
##               Median :36.00   Median : 61.50   Median :50.00  
##               Mean   :38.85   Mean   : 60.56   Mean   :50.20  
##               3rd Qu.:49.00   3rd Qu.: 78.00   3rd Qu.:73.00  
##               Max.   :70.00   Max.   :137.00   Max.   :99.00

The table above indicates, that in total we have 200 customers in our dataset, from which 112 are women and 88 are men. The age of the customers varies from 18 to 70 and the annual income from 15 to 137 (given in thousands). The variable SpendingScore is a score assigned by the mall based on the customer behaviour and spending nature, where 99 is a maximum and 1 is a minimum value. In order to better understand the data, let’s draw a few histograms of given variables.

hist(customers$Age,
     col = 'red',
     main = 'Age of Customers',
     xlab = 'Age')

Looking at the age of customers, one can see that the biggest group is people from 30 to 35 years old (more than 35 members). The distribution of this variable is skewed to the right, so there is far more people on the left hand-side (40 years old or younger) than on the right-hand side.

hist(customers$AnnualIncome,
     col = 'blue',
     main = 'Annual Income of Customers',
     xlab = 'Annual Income')

When it comes to annual income of customers, the distribution looks somewhat smiliar to the previous one, although the difference between left and right-hand side of the histogram is even greater. There is a lot of customers, that earn up to 80 thousands annually. Earnings greater than this amount are far less popular than smaller incomes.

hist(customers$SpendingScore,
     col = 'orange',
     main = 'Spending Score of Customers',
     xlab = 'Spending Score')

Finally, the spending score of customers resembles the normal distrbution the most out of these three. Most of the observations are concentrated around the mean (50) and both sides of the graph look very similar. What is interesting, there are many observations at both ends, meaning that there is a quite a lot of customers who spend very little or very much.

Next step of my analysis would be to plot Annual Income and Spending Score in respect to Age and Gender. This could allow to make first more meaningful conclusions from the mall customers dataset.

library(ggplot2)
## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'
ggplot(customers) +
  geom_point(aes(x = Age, y = AnnualIncome, col = Gender)) 

ggplot(customers) +
  geom_point(aes(x = Age, y = SpendingScore, col = Gender))

ggplot(customers) +
  geom_point(aes(x = AnnualIncome, y = SpendingScore, col = Gender))

From the first plot, we can conclude that the highest income can be observed for people with age from 30 to 50. From the second plot we find out that, all the ‘big spenders’ have not more than 40 years old. Customers above that age, tend to be more frugal as the highest values of Spending Score are around 60 points. But the most interesting patterns are presented on the last plot. We can see that observations tend to group themselves in a couple of areas on the graph. There is a numerous group right in the middle and a few groups in the corners of the plot. Gender seems to have little effect when income and spending of customers is analysed.

K-means clustering

I’ll use k-means algorithm, a method which identifies k number of centroids (centers of a cluster) and allocates every single observation (data point) to the nearest cluster. In the previous part of my paper we found out that two variables: AnnualIncome and SpendingScore are the ones that influence consumer behaviour the most. Therefore the clusters will be generated only on the basis of these two variables.

First, it is crucial to determine the most optimal number of clusters. In order to do so I’ll use so called elbow method from the ClusterR package, which plots the explained variation as a function of the number of clusters. The ‘elbow’ of the plot should correspond to the most optimal number of clusters.

library(ClusterR)
opt <- Optimal_Clusters_KMeans(customers[, 3:4], max_clusters = 10, plot_clusters = T)

From the plot above, it is quite hard to clearly define where does the ‘elbow’ occur. Let’s use another method called ‘silhouette’ to identify the number of clusters. It measures how similar is an object to its own cluster compared to other clusters. I’ll plot the average sillhoute value for k ranging from 2 to 10 and the highest value received should help us determine most optimal number of clusters.

opt <- Optimal_Clusters_KMeans(customers[, 3:4], max_clusters = 10, plot_clusters = T, criterion = 'silhouette')

The highest average sillhoute value (equal to 0.54) is present for k = 5. Therefore we should opt for 5 clusters in our further analysis with k-means algorithm. In the next step we will add the cluster number to each observation in our dataframe and plot the observations in respect to their cluster with ggplot.

set.seed(22)
km <- kmeans(customers[,3:4], 5)
customers$ClusterNumber <- km$cluster

ggplot(customers[,3:5])  +
  geom_point(aes(x = AnnualIncome, y = SpendingScore, col = as.factor(ClusterNumber))) +
  scale_color_discrete(name="Cluster Number")

The plot above clearly indicates that k-means algorithm distinguished 5 groups of the customers. There are 2 different groups of customers on the left-hand side, with small incomes and spending scores being high in one of the group and low in the other. Similarly, we can see 2 groups that were formed on the right-hand side of our graph, this time including peple that earnings are above the average. Finally, there is one group in the middle, with customers who have medium annual earnings and also their spending habits are somehow in the middle of the whole population. In the next part of my paper, I will perform the hierarchical clustering and compare the results of both methods.

Hierarchical clustering

In this part of customers segmentation analysis, I will use agglomerative hierarchical clustering (also known as bottom-up approach), a method used o group objects based on their similarity. At the beginning each observation starts in its own cluster, and step by step pairs of clusters are merged as we move up the hierarchy. Before I implement the clustering algorithm, I will compare the distances between the data points and in the next step the ‘hclust’ function will be used to perform the cluster analysis.

dist_customers <- dist(customers[,3:4])
hc_customers <- hclust(dist_customers, method = 'complete')

Now it’s the time to choose the optimal number of clusters. In order to do that, similarly as in the k-means method, I will use the silhoutte criterion.

library(cluster)
for(i in 2:7) { 
  nam <- paste("clust", i, sep = "")
  assign(nam, cutree(hc_customers, k = i))
}

par(mfrow = c(3, 2))

plot(silhouette(clust2, dist_customers), col = "blue")
plot(silhouette(clust3, dist_customers), col = 'blue')
plot(silhouette(clust4, dist_customers), col = "blue")
plot(silhouette(clust5, dist_customers), col = "blue")
plot(silhouette(clust6, dist_customers), col = "blue")
plot(silhouette(clust7, dist_customers), col = "blue")

It occurs, that here we should also opt for 5 clusters as the average silhouette width is the biggest for k = 5. Now we are ready to plot the dendrogram. To make it more meaningful I will use colors in order to indicate which observations fall to which cluster.

library(dendextend)
clust_customers <- cutree(hc_customers, k = 5)
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, k = 5)
par(mfrow = c(1, 1))
plot(dend_colored)

The final step would be to create a new data frame with cluster numbers assigned to each observation. This way we can create a plot, that will reveal to us, how the observations where grouped.

library(dplyr)
segment_customers <- mutate(customers, cluster = clust_customers)
segment_customers = subset(segment_customers, select = -c(ClusterNumber))
ggplot(segment_customers, aes(x = AnnualIncome, y = SpendingScore, color = factor(cluster))) +
  geom_point() +
  scale_color_discrete(name = 'Cluster number')

Presented graph is very alike to the one we obtained with k-means clustering. Except couple of observations that are now in the group with medium spending score and annual income, all the other ones are in the same groups as in previous clustering. Finally, we can find out what characteristcs each of the groups have by exploring the ‘segment_customers’ dataframe.

segment_customers %>% group_by(cluster, Gender) %>%
  summarise_all(list(mean)) %>% arrange(cluster)
## # A tibble: 10 x 5
## # Groups:   cluster [5]
##    cluster Gender   Age AnnualIncome SpendingScore
##      <int> <fct>  <dbl>        <dbl>         <dbl>
##  1       1 Female  43.2         27.4          21.7
##  2       1 Male    48.3         24.7          19.7
##  3       2 Female  25.6         24.6          81.8
##  4       2 Male    25           25.8          77.7
##  5       3 Female  40.5         55.8          48.6
##  6       3 Male    45.4         55.9          49.9
##  7       4 Female  32.2         86.0          81.7
##  8       4 Male    33.3         87.1          82.7
##  9       5 Female  43.8         93.3          20.6
## 10       5 Male    38.8         86.4          11.7

In the first cluster we have middle aged women and men, whose both annual income and spending scores are small. In the second group we have young women and men, who despite the fact they don’t have much income, tend to spend a lot. The third group is the most numerous one, which was right in the middle of the presented plots. This cluster constitues of female and male in their forties, which get a middle-sized wages and have moderate spending habits. In group number 4, there are mostly people in their early 30s who earn a lot and also tend to spend much. In the last cluster (number 5), we can see women whose average age was around 44 years old and men with average age circa 39 years old. This group of people, similarly to those in group number 4, have high annual incomes, but on the contrary, they do not like to spend much.

All in all this table shown us, how many important information we can get from the clustering analysis. This paper analysed only a very basic, two dimensional example, which would not be the scenario in most of the business use cases. Nevertheless, I believe that these methods, with some other additional analysis, can be successfully implemented to real business problems.