Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests and spending habits.
Benefits of customer segmentation include:
Personalisation
Personalisation ensures that you provide exceptional customer experience.
Customer Retention
It is 16 times as costly to build a long-term business relationship with a new customer than simply to cultivate the loyalty of an existing customer.
Better ROI for marketing
Affirmations that right marketing messages are sent to the right people based on their life cycle stage.
Reveal new opportunities
Customer segmentation may reveal new trends about products and it may even give the first mover’s advantage in a product segment.
Unsupervised Learning is a class of Machine Learning techniques to find the patterns in data. The data given to unsupervised algorithm are not labelled, which means only the input variables(X) are given with no corresponding output variables. In unsupervised learning, the algorithms are left to themselves to discover interesting structures in the data.
The following code takes advantage of the Mall Customer Segmentation Data to demonstrate the ability of K-Means clustering algorithm to identify customer’s segments.
Importing packages
library(ggplot2); theme_set(theme_bw())
library(dplyr)
library(skmeans)
library(gridExtra)
The file is read-
customers <- read.csv("Mall_Customers.csv")
attach(customers)
Before we proceed, we have to make sure that the data is in the correct format so as to be used for modelling later on
## CHECK THE DIMENSIONS OF THE DATA
dim(customers)
## [1] 200 5
str(customers)
## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 1 1 1 2 1 ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...
The data is just a mere sample of a larger amount of customers data. However, the purpose of this document is the familirisation of K-Means and unsupervised learning
Let us have a glimpe of the data before we dive in more depth.
## TAKE A LOOK OF THE DATA
head(customers)
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
colnames(customers)
## [1] "CustomerID" "Gender"
## [3] "Age" "Annual.Income..k.."
## [5] "Spending.Score..1.100."
The data consists of
## UNIQUE CUSTOMERS
n_distinct(CustomerID)
## [1] 200
Individual customers, where the majority pertains to women.
## GENDERS DISTRIBUTION
as.data.frame(table(Gender)) %>%
ggplot(aes(x = Gender, y = Freq)) +
geom_bar(stat = "identity", fill = "#F8766D") +
geom_text(y = as.vector(table(Gender)), label = paste0((as.vector(table(Gender))/sum(as.vector(table(Gender))))*100, "%"))
From a business perspective, most the companies that appear to have a successful story are extremely focuced on a particular target group so as to provide the best experience for them. Hence, businesses are primarly focused on such relevant activities. In addition, occasionally, a business may select more than one segment as the focus of its activities, in which case, it would normally identify a primary target and a secondary target. Primary target markets are those market segments to which marketing efforts are primarily directed and where more of the business’s resources are allocated, while secondary markets are often smaller segments or less vital to a product’s success.
The Age variable would be a good indicator of the targeted Age groups
## SUMMARY OF THE AGE VARIABLE
summary(Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 28.75 36.00 38.85 49.00 70.00
ggplot(as.data.frame(Age), aes(y = Age)) + geom_boxplot(fill='#F8766D')
## AGE VARIABLE DISTRIBUTION
ggplot(customers, aes( x = Age, fill = Gender)) + geom_density(alpha = 0.4)
It is quite interesting that there is a difference between the two genders. It appears that in both groups (i.e. Males & Females) there is a strong activity at the ages 25-35, while the data shows another frequent group from the female part at the age of around 45 years old. In contrast, the group of men curve declines as the age reaches the maximum age of 70.
The income variable and the spending score variables are the ones that most interests us since we are going to keep this variables to perform our clustering. Investigating a little bit more on this variables we can see that they behave normally and not any anomalies are detected (1 data point at the Income variable is considered to be a correct value).
## ANNUAL INCOME AND SPENDING SCORE BOXPLOTS
p1 <- ggplot(as.data.frame(Annual.Income..k..), aes(y = Annual.Income..k..)) + geom_boxplot(fill='#F8766D') + ylim(c(1,150))
p2 <- ggplot(as.data.frame(Spending.Score..1.100.), aes(y = Spending.Score..1.100.)) + geom_boxplot(fill='#00BFC4') + ylim(c(1,150))
grid.arrange(p1, p2, ncol = 2)
FInally, the dataset does not contain any N/A values so no missing values techniques need to be applied for that reason. Our model is really sensitive in N/A values as will be discussed next.
## N/A VALUES
cat("There are", sum(is.na(customers)), "N/A values.")
## There are 0 N/A values.
K-means clustering is an unsupervised machine learning algorithm for clustering ‘n’ observations into ‘k’ clusters where k is predefined or user-defined constant. The main idea is to define k centroids, one for each cluster.
The K-Means algorithm involves:
## SAVE THE VARIABLES OF INTEREST
Kdata <- customers[,c(4,5)]
Determine the number of clusters using the Elbow approach
## K estimation using Elbow approach
tot.withinss <- vector("numeric", length = 10)
for (i in 1:10){
kDet <- kmeans(Kdata, i)
tot.withinss[i] <- kDet$tot.withinss
}
ggplot(as.data.frame(tot.withinss), aes(x = seq(1,10), y = tot.withinss)) +
geom_point(col = "#F8766D") +
geom_line(col = "#F8766D") +
theme(axis.title.x.bottom = element_blank()) +
ylab("Within-cluster Sum of Squares") +
xlab("Number of Clusters") +
ggtitle("Elbow K Estimation")
It can be seen from the graph above that a reasonable selection for the K value would be the k = 5. Hence, we are going to create 5 clusters to generate our segments.
## CLUSTER THE DATA
customerClusters <- kmeans(Kdata, 5)
customerClusters
## K-means clustering with 5 clusters of sizes 100, 38, 22, 10, 30
##
## Cluster means:
## Annual.Income..k.. Spending.Score..1.100.
## 1 47.96000 43.25000
## 2 87.00000 18.63158
## 3 25.72727 79.36364
## 4 109.70000 82.00000
## 5 78.23333 81.36667
##
## Clustering vector:
## [1] 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1
## [36] 3 1 3 1 3 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5
## [141] 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2 5 2
## [176] 5 2 5 2 5 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4
##
## Within cluster sum of squares by cluster:
## [1] 42058.590 14204.842 3519.455 2512.100 4370.333
## (between_SS / total_SS = 75.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
## VISULIASE THE CLUSTERS
ggplot(Kdata, aes(x = Annual.Income..k.., y = Spending.Score..1.100.)) +
geom_point(stat = "identity", aes(color = as.factor(customerClusters$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5")) +
ggtitle("Mall Customer Segmens", subtitle = "K-means Clustering")
Result
The generated ‘Clusters of Customers’ plot shows the distribution of the 5 clusters. A sensible interpretation for the mall customer segments can be:
Cluster 1. Customers with medium annual income and medium annual spend
Cluster 2. Customers with high annual income and high annual spend
Cluster 3. Customers with low annual income and low annual spend
Cluster 4. Customers with high annual income but low annual spend
Cluster 5. Customers low annual income but high annual spend
Having a better understanding of the customers segments, a company could make better and more informed decisions. An example, there are customers with high annual income but low spending score. A more strategic and targeted marketing approach could lift their interest and make them become higher spenders. The focus should also be on the “loyal” customers and maintain their satisfaction.
We have thus seen, how we could arrive at meaningful insights and recommendations by using clustering algorithms to generate customer segments. For the sake of simplicity, the dataset used only 2 variables - income and spend. In a typical business scenario, there could be several variables which could possibly generate much more realistic and business-specific insights.