Developing Personas and Segments is a critical step on creating a personalization strategy for a business.
Customer Segmentation groups are based on distinct characteristics. Segments are generally developed through big-data analysis and are defined using demographic information such as age, income and location or behavioral information such as interests, opinions, values, lifestyle, risk aversion, or life stage. Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, helps companies to develop a better understanding of the needs and purchase motivations of customers in different segments of the market which makes easier to target products/services tailored to group´s preferences/ needs.
A Persona is a semi-fictional representation of a customer market segmentation. Customer segments don’t provide insights into a customer, for this reason we use the Personas, the Personas bring you customers to life.
In this analysis, the goal is to achieve Value-based segmentation of the customers based on the annual spending in monetary units (m.u.) on diverse product categories.List of variables in the dataset:
Libraries required:
library(ggplot2)
library(DataExplorer)
library(dplyr)
library(gridExtra)
library(corrplot)
library(clustertend)
library(cluster)
library(factoextra)
data <- read.csv("Wholesale customers data.csv")
head(data)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 4 1 3 13265 1196 4221 6404 507 1788
## 5 2 3 22615 5410 7198 3915 1777 5185
## 6 2 3 9413 8259 5126 666 1795 1451
str(data)
## 'data.frame': 440 obs. of 8 variables:
## $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
## $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
## $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
## $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
## $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
## $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
## $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
We can see that this dataset doesn’t have any missing values
options(repr.plot.width=8, repr.plot.height=3)
# look for missing values using the DataExplorer package
plot_missing(data,
geom_label_args = list("size" = 3, "label.padding" = unit(0.1, "lines")),
ggtheme = theme_minimal())
First, we need to transform Channel and region variables to Factors
data$Channel<-as.factor(data$Channel)
data$Region<-as.factor(data$Region)
Secondly, we need to correct the categorization. The variable Channel and Region will be transformed to these levels:
data <- data %>% mutate(Channel=recode(Channel,
'1'= "Hotel/Restaurant/Cafe",
'2'= "Retail"))
data <- data %>% mutate(Region=recode(Region,
'1'= "Lisbon",
'2'= "Oporto",
'3'= "Other"))
head(data)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper
## 1 Retail Other 12669 9656 7561 214 2674
## 2 Retail Other 7057 9810 9568 1762 3293
## 3 Retail Other 6353 8808 7684 2405 3516
## 4 Hotel/Restaurant/Cafe Other 13265 1196 4221 6404 507
## 5 Retail Other 22615 5410 7198 3915 1777
## 6 Retail Other 9413 8259 5126 666 1795
## Delicassen
## 1 1338
## 2 1776
## 3 7844
## 4 1788
## 5 5185
## 6 1451
In this part, it is going to be measured the negative values of the annual income from Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen. It is important to check if some negative records are in the dataset, to remove them.
In this case we don’t have negative records.
negative_values <- function(x) {100*sum(x <= 0) / length(x)}
negative <- sapply(data %>% select(Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen), negative_values)
negative
## Fresh Milk Grocery Frozen
## 0 0 0 0
## Detergents_Paper Delicassen
## 0 0
summary(data)
## Channel Region Fresh Milk
## Hotel/Restaurant/Cafe:298 Lisbon: 77 Min. : 3 Min. : 55
## Retail :142 Oporto: 47 1st Qu.: 3128 1st Qu.: 1533
## Other :316 Median : 8504 Median : 3627
## Mean : 12000 Mean : 5796
## 3rd Qu.: 16934 3rd Qu.: 7190
## Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
graph1 <- data %>%
group_by(Channel) %>%
dplyr::summarise(count = n()) %>%
ggplot(aes(x = Channel, y = count)) +
geom_col(fill = "#5dc1b9") +
coord_flip() +
ggtitle("Customers Channel", "Total Customers by Channel") +
geom_label(aes(x = Channel, y = count, label = count)) +
labs(x = "Channel", y = "Total customers") +
theme_minimal()
graph2 <- data %>%
group_by(Region) %>%
dplyr::summarise(count = n()) %>%
ggplot(aes(x = Region, y = count)) +
geom_col(fill = "#5dc1b9") +
coord_flip() +
ggtitle("Customers Region", "Total Customers by Region") +
geom_label(aes(x = Region, y = count, label = count)) +
labs(x = "Region", y = "Total customers") +
theme_minimal()
graph1
graph2
From the graphs above it can be noticed that the vast number of customers (316 out of 440) correspond to Other region. It can be concluded that 68% (298 out of 440) of customers are from hotels, restaurants or cafés.
par(mfrow=c(6,2))
p1 <- ggplot(data, aes(x=Fresh, y=..count..)) +
geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
labs(title="Fresh distribution", subtitle = "Histogram Chart",
x = "Fresh", y = "Frequency")
p2 <- ggplot(data, aes(x=Milk, y=..count..)) +
geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
labs(title="Milk distribution", subtitle = "Histogram Chart",
x = "Milk", y = "Frequency")
p3 <- ggplot(data, aes(x=Grocery, y=..count..)) +
geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
labs(title="Grocery distribution", subtitle = "Histogram Chart",
x = "Grocery", y = "Frequency")
p4 <- ggplot(data, aes(x=Frozen, y=..count..)) +
geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
labs(title="Frozen distribution", subtitle = "Histogram Chart",
x = "Frozen", y = "Frequency")
p5 <- ggplot(data, aes(x=Detergents_Paper, y=..count..)) +
geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
labs(title="Detergents Paper distribution", subtitle = "Histogram Chart",
x = "Detergents Paper", y = "Frequency")
p6 <- ggplot(data, aes(x=Delicassen, y=..count..)) +
geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
labs(title="Delicassen distribution", subtitle = "Histogram Chart",
x = "Delicassen", y = "Frequency")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol = 2)
data_corr <- data %>%
dplyr::select(Delicassen, Detergents_Paper, Frozen, Grocery, Milk, Fresh)
var_correlation <-cor(data_corr, use="pairwise.complete.obs")
corrplot(var_correlation, method="color",
addCoef.col = "black", number.cex = 0.5, tl.cex=0.8, tl.srt=70,tl.col="black" )
The most correlated variables are:
The popular way of determining number of clusters are:
Elbow and Silhouette methods are direct methods and gap statistic method is the statistics method.
Standardizing data is recommended because otherwise the range of values in each feature will act as a weight when determining how to cluster data, which is typically undesired. So, for this reason, the data is scaled/standardized:
data2 <- data[,-c(1,2)] # remove categorical columns 1 and 2
scaled_data <- scale(data2)
The results suggest that 5 is the optimal number of clusters as it appears to be the bend in the knee (or elbow).
set.seed(123)
fviz_nbclust(scaled_data , kmeans, method = "wss")
The results show that 2 clusters maximize the average silhouette values with 4 clusters coming in as second optimal number of clusters.
set.seed(123)
fviz_nbclust(scaled_data , kmeans, method = "silhouette")
The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data. In this case, gap statistic suggests optimal clusters should be 3.
set.seed(123)
fviz_nbclust(scaled_data, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
These are the visualizations for 2,3 and 5 clusters. The function used to picture the results is fviz_cluster. This function visualizes the cluster in 2 dimensions and performs Principle Components Analysis (PCA) behind the scenes to reduce the dimensions such that data can be represented by clusters in a 2-D space.
set.seed(123)
k2 <- kmeans(scaled_data, centers = 2, nstart = 25) # centers = number of clusters to divide customer list into and nstart = number of random sets to be chosen)
k3 <- kmeans(scaled_data, centers = 3, nstart = 25)
k5 <- kmeans(scaled_data, centers = 5, nstart = 25)
par(mfrow=c(4,2))
# plots to compare
p7 <- fviz_cluster(k2, geom = "point", data = scaled_data) + ggtitle("2 Clusters")
p8 <- fviz_cluster(k3, geom = "point", data = scaled_data) + ggtitle("3 Clusters")
p9 <- fviz_cluster(k5, geom = "point", data = scaled_data) + ggtitle("5 Clusters")
grid.arrange(p7, p8, p9, nrow = 2)
The Elbow method suggest that 5 is the optimal number of clusters. Silhouette method suggests show that 2 clusters maximize the average silhouette values with max 4 clusters coming in as second optimal number of clusters. Gap statistic suggests 3 clusters. In this step, the final approach will be analyzed in detail
set.seed(123)
print(k2)
## K-means clustering with 2 clusters of sizes 41, 399
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 0.05283636 2.0659269 2.2407190 0.32219794 2.2585338 0.8039597
## 2 -0.00542930 -0.2122882 -0.2302493 -0.03310806 -0.2320799 -0.0826124
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 1 2 1 1 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2
## [75] 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [149] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2
## [334] 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [408] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
##
## Within cluster sum of squares by cluster:
## [1] 966.3860 982.9619
## (between_SS / total_SS = 26.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
print(k3)
## K-means clustering with 3 clusters of sizes 13, 322, 105
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 1.2628701 3.8420545 3.4733327 1.70013636 3.2968964 2.37006056
## 2 0.1121903 -0.3514030 -0.4256645 0.04403387 -0.4182375 -0.12285483
## 3 -0.5004055 0.6019528 0.8753395 -0.34553027 0.8744079 0.08331873
##
## Clustering vector:
## [1] 2 3 3 2 2 2 2 2 2 3 3 2 3 3 3 2 3 2 2 2 2 2 2 1 3 2 2 2 3 2 2 2 2 2 2 3 2
## [38] 3 3 2 2 2 3 3 3 3 3 1 3 3 2 2 2 3 2 2 3 3 2 2 2 1 2 3 2 1 2 3 2 2 2 3 2 2
## [75] 2 2 2 3 2 2 2 3 3 2 2 1 1 2 2 2 2 2 1 2 3 2 2 2 2 2 3 3 3 2 2 2 3 3 2 3 2
## [112] 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [149] 2 2 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 3 3 2 2 2 3 3 2 3 2 3 2 2 2 2 2 1 3 1 2
## [186] 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 2 3 2 2 2 3 2 1 2 2 3 3 3 2 3 2 2 3
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 1 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 3 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [297] 2 2 3 2 2 3 3 3 3 3 3 2 2 3 2 2 3 2 2 3 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 3 2
## [334] 1 2 2 2 2 2 2 3 3 3 3 2 2 3 2 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2
## [371] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## [408] 3 2 2 2 2 3 2 2 2 3 3 3 2 3 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 3 2 2
##
## Within cluster sum of squares by cluster:
## [1] 693.5608 677.2089 239.5571
## (between_SS / total_SS = 38.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
print(k5)
## K-means clustering with 5 clusters of sizes 10, 269, 97, 1, 63
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 0.3134735 3.9174467 4.2707490 -0.003570131 4.6129149 0.5027930
## 2 -0.2281097 -0.3850613 -0.4383243 -0.163797758 -0.3991069 -0.1945037
## 3 -0.4962279 0.6810009 0.9032545 -0.332321693 0.8994410 0.1018261
## 4 1.9645810 5.1696185 1.2857533 6.892753825 -0.5542311 16.4597113
## 5 1.6570840 -0.1082488 -0.2174555 1.102218231 -0.4041420 0.3326463
##
## Clustering vector:
## [1] 2 3 3 2 5 2 2 2 2 3 3 2 5 3 3 2 3 2 2 2 2 2 5 3 3 2 2 2 3 5 2 2 2 5 2 3 5
## [38] 3 3 5 5 2 3 3 3 3 3 1 3 3 2 2 5 3 2 2 1 3 2 2 2 1 2 3 2 1 2 3 2 2 5 5 2 5
## [75] 2 2 2 3 2 2 2 3 3 2 2 1 1 5 2 5 2 2 1 5 3 2 2 2 2 2 3 3 2 5 2 2 3 3 2 3 2
## [112] 3 5 2 2 2 2 2 2 2 2 2 2 2 5 5 5 2 2 5 2 2 2 2 2 2 2 2 2 2 2 5 5 2 2 3 2 2
## [149] 2 5 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 3 3 2 2 2 3 3 2 3 2 3 5 2 2 2 2 5 3 4 2
## [186] 2 2 2 3 3 2 2 2 3 2 5 5 3 2 2 3 3 5 2 2 3 2 2 2 3 2 1 2 2 3 3 3 2 3 2 2 3
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 2 2 2 3 3 2 2 2 2 2 1 2 5 3 5 2 2 5
## [260] 5 2 2 2 2 3 3 3 2 3 2 2 2 2 5 2 2 5 5 2 2 2 2 5 5 5 5 2 2 2 5 2 2 2 3 2 2
## [297] 2 2 2 2 2 3 3 3 3 3 3 2 2 3 2 5 3 2 2 3 2 2 2 3 2 2 2 2 2 5 2 2 2 2 2 3 2
## [334] 1 5 5 2 2 2 2 3 3 2 3 2 2 3 5 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 5 5 2 2 2 2 3 5 2 2 5 5 5 2 3 2 2 2 2 2 2 2 2 5 2 2 3 2 2 2 2 5 2 2 2 2 5
## [408] 3 2 2 2 2 2 5 2 2 3 2 3 2 3 2 2 2 2 5 3 5 2 2 2 5 2 2 2 5 5 3 2 2
##
## Within cluster sum of squares by cluster:
## [1] 149.4481 235.0199 231.7329 0.0000 440.1481
## (between_SS / total_SS = 59.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
center.k3 <- kmeans(data2, centers = 3)
center.k5 <- kmeans(data2, centers = 5)
center.k3 <- center.k3$centers
center.k5 <- center.k5$centers
round(prop.table(center.k3, 2) * 100)
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 16 13 13 23 12 18
## 2 15 65 70 18 82 35
## 3 69 21 16 60 7 47
round(prop.table(center.k5, 2) * 100)
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 21 6 6 18 3 11
## 2 48 8 7 37 3 16
## 3 21 61 59 28 66 56
## 4 6 5 5 11 4 6
## 5 5 19 23 7 24 10
Cluster 3 - Segmentation
Cluster 5 - Segmentation
Based on the analysis performed, the Personas or customer segmentation of Wholesales Service based on the customer spending habits , should be:
Customer Segmentation is an important marketing strategy that organizations should deploy for their products and services. It involves “dividing a large, heterogeneous market into smaller segments of consumers with distinct needs, characteristics, or behaviors that might require separate strategies”
And based on these segmentation, this will allow to build the Personas, the Personas are consumer group or cluster that shares similar values, behaviors, and goals. Personas begin with those basic profiles, but then are given names, faces, person. Personas help to understand the emotional and behavioral triggers behind individual consumers.
Have an understanding on this point is key for having a breakthrough Consumer Experience. And K-Means clustering is a simple but powerful machine learning algorithm that allow us to have this point ready.