1. Overview

Developing Personas and Segments is a critical step on creating a personalization strategy for a business.

Customer Segmentation groups are based on distinct characteristics. Segments are generally developed through big-data analysis and are defined using demographic information such as age, income and location or behavioral information such as interests, opinions, values, lifestyle, risk aversion, or life stage. Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, helps companies to develop a better understanding of the needs and purchase motivations of customers in different segments of the market which makes easier to target products/services tailored to group´s preferences/ needs.

A Persona is a semi-fictional representation of a customer market segmentation. Customer segments don’t provide insights into a customer, for this reason we use the Personas, the Personas bring you customers to life.

In this analysis, the goal is to achieve Value-based segmentation of the customers based on the annual spending in monetary units (m.u.) on diverse product categories.

2. Data

2.1 Data Information

List of variables in the dataset:

  • FRESH: annual spending (m.u.) on fresh products (Continuous)
  • MILK: annual spending (m.u.) on milk products (Continuous)
  • GROCERY: annual spending (m.u.) on grocery products (Continuous)
  • FROZEN: annual spending (m.u.) on frozen products (Continuous);
  • DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  • DELICATESSEN: annual spending (m.u.) on and delicatessen products (Continuous)
  • CHANNEL: customers’ Channel - Horeca (Hotel/Restaurant/Café) or - Retail channel (Nominal)
  • REGION: customers’ Region - Lisbon, Oporto or Other (Nominal)

Libraries required:

library(ggplot2)
library(DataExplorer)
library(dplyr)
library(gridExtra)
library(corrplot)
library(clustertend)
library(cluster)
library(factoextra)

2.2 Data Loading

data <- read.csv("Wholesale customers data.csv")
head(data)
##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       2      3 12669 9656    7561    214             2674       1338
## 2       2      3  7057 9810    9568   1762             3293       1776
## 3       2      3  6353 8808    7684   2405             3516       7844
## 4       1      3 13265 1196    4221   6404              507       1788
## 5       2      3 22615 5410    7198   3915             1777       5185
## 6       2      3  9413 8259    5126    666             1795       1451
str(data)
## 'data.frame':    440 obs. of  8 variables:
##  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
##  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
##  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
##  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
##  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
##  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
##  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

2.2 Data Preprocessing

2.2.1 Missing Values

We can see that this dataset doesn’t have any missing values

options(repr.plot.width=8, repr.plot.height=3)
# look for missing values using the DataExplorer package
plot_missing(data, 
             geom_label_args = list("size" = 3, "label.padding" = unit(0.1, "lines")),
             ggtheme = theme_minimal())

2.2.2 Data Transformation

First, we need to transform Channel and region variables to Factors

data$Channel<-as.factor(data$Channel)
data$Region<-as.factor(data$Region)

Secondly, we need to correct the categorization. The variable Channel and Region will be transformed to these levels:

  • Channel variable has two values, value “1” refers to “Hotel/Restaurant/Cafe” and value “2” refers to “Retail”
  • Region variable has three values, “1”=“Lisbon”, “2”=“Oporto”, “3”=“Other”
data <- data %>% mutate(Channel=recode(Channel,
                '1'= "Hotel/Restaurant/Cafe",
                '2'= "Retail")) 

data <- data %>% mutate(Region=recode(Region,
                '1'= "Lisbon",
                '2'= "Oporto",
                '3'= "Other"))
head(data)
##                 Channel Region Fresh Milk Grocery Frozen Detergents_Paper
## 1                Retail  Other 12669 9656    7561    214             2674
## 2                Retail  Other  7057 9810    9568   1762             3293
## 3                Retail  Other  6353 8808    7684   2405             3516
## 4 Hotel/Restaurant/Cafe  Other 13265 1196    4221   6404              507
## 5                Retail  Other 22615 5410    7198   3915             1777
## 6                Retail  Other  9413 8259    5126    666             1795
##   Delicassen
## 1       1338
## 2       1776
## 3       7844
## 4       1788
## 5       5185
## 6       1451

2.2.3 Negative Values

In this part, it is going to be measured the negative values of the annual income from Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen. It is important to check if some negative records are in the dataset, to remove them.

In this case we don’t have negative records.

negative_values <- function(x) {100*sum(x <= 0) / length(x)}

negative <- sapply(data %>% select(Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen), negative_values)

negative
##            Fresh             Milk          Grocery           Frozen 
##                0                0                0                0 
## Detergents_Paper       Delicassen 
##                0                0

3. Exploratory Data Analysis

3.1 Data Summary

summary(data)
##                   Channel       Region        Fresh             Milk      
##  Hotel/Restaurant/Cafe:298   Lisbon: 77   Min.   :     3   Min.   :   55  
##  Retail               :142   Oporto: 47   1st Qu.:  3128   1st Qu.: 1533  
##                              Other :316   Median :  8504   Median : 3627  
##                                           Mean   : 12000   Mean   : 5796  
##                                           3rd Qu.: 16934   3rd Qu.: 7190  
##                                           Max.   :112151   Max.   :73498  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

3.2 Barplot by Region and Channel

graph1 <- data %>%
  group_by(Channel)  %>%
  dplyr::summarise(count = n()) %>%
  ggplot(aes(x = Channel, y = count)) +
  geom_col(fill = "#5dc1b9") +
  coord_flip() +
  ggtitle("Customers Channel", "Total Customers by Channel") +
  geom_label(aes(x = Channel, y = count, label = count)) +
  labs(x = "Channel", y = "Total customers") +
  theme_minimal() 

graph2 <- data %>%
  group_by(Region)  %>%
  dplyr::summarise(count = n()) %>%
  ggplot(aes(x = Region, y = count)) +
  geom_col(fill = "#5dc1b9") +
  coord_flip() +
  ggtitle("Customers Region", "Total Customers by Region") +
  geom_label(aes(x = Region, y = count, label = count)) +
  labs(x = "Region", y = "Total customers") +
  theme_minimal() 
  
graph1

graph2

From the graphs above it can be noticed that the vast number of customers (316 out of 440) correspond to Other region. It can be concluded that 68% (298 out of 440) of customers are from hotels, restaurants or cafés.

3.3 Distribution of singles variables

par(mfrow=c(6,2))

p1 <- ggplot(data, aes(x=Fresh, y=..count..)) + 
  geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
     labs(title="Fresh distribution",  subtitle = "Histogram Chart",
       x = "Fresh", y = "Frequency") 

p2 <- ggplot(data, aes(x=Milk, y=..count..)) + 
  geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
     labs(title="Milk distribution",  subtitle = "Histogram Chart",
       x = "Milk", y = "Frequency") 

p3 <- ggplot(data, aes(x=Grocery, y=..count..)) + 
  geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
     labs(title="Grocery distribution",  subtitle = "Histogram Chart",
       x = "Grocery", y = "Frequency") 

p4 <- ggplot(data, aes(x=Frozen, y=..count..)) + 
  geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
     labs(title="Frozen distribution",  subtitle = "Histogram Chart",
       x = "Frozen", y = "Frequency") 

p5 <- ggplot(data, aes(x=Detergents_Paper, y=..count..)) + 
  geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
     labs(title="Detergents Paper distribution",  subtitle = "Histogram Chart",
       x = "Detergents Paper", y = "Frequency") 

p6 <- ggplot(data, aes(x=Delicassen, y=..count..)) + 
  geom_histogram(position="dodge", fill="#5dc1b9",bins=30) +
     labs(title="Delicassen distribution",  subtitle = "Histogram Chart",
       x = "Delicassen", y = "Frequency") 


grid.arrange(p1,p2,p3,p4,p5,p6, ncol = 2)

3.4 Correlation

data_corr <- data %>%
  dplyr::select(Delicassen, Detergents_Paper, Frozen, Grocery, Milk, Fresh)

var_correlation <-cor(data_corr, use="pairwise.complete.obs")
corrplot(var_correlation, method="color", 
         addCoef.col = "black", number.cex = 0.5,  tl.cex=0.8, tl.srt=70,tl.col="black" )

The most correlated variables are:

  • Detergents and Grocery (0,92)
  • Milk and grocery (0.73)
  • Milk and Detergents (0.66)

4. Data Clustering

Since there is not a large amount of data, it is recommended to use k-clustering instead of other methods, like hierarchical clustering model. KMeans algorithm is the most commonly used unsupervised machine learning algorithm used to partition the data into a set of k groups or clusters. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

4.1 Optimal number of clusters for K-Means

The popular way of determining number of clusters are:

  • Elbow Method
  • Silhouette Method
  • Gap Static Method

Elbow and Silhouette methods are direct methods and gap statistic method is the statistics method.

Standardizing data is recommended because otherwise the range of values in each feature will act as a weight when determining how to cluster data, which is typically undesired. So, for this reason, the data is scaled/standardized:

data2 <- data[,-c(1,2)] # remove categorical columns 1 and 2
scaled_data <- scale(data2)

a) Elbow Method

The results suggest that 5 is the optimal number of clusters as it appears to be the bend in the knee (or elbow).

set.seed(123)
fviz_nbclust(scaled_data  , kmeans, method = "wss")

b) Silhouette Method

The results show that 2 clusters maximize the average silhouette values with 4 clusters coming in as second optimal number of clusters.

set.seed(123)
fviz_nbclust(scaled_data , kmeans, method = "silhouette")

c) Gap Analysis

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data. In this case, gap statistic suggests optimal clusters should be 3.

set.seed(123)
fviz_nbclust(scaled_data, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")

4.2 Cluster Representation

These are the visualizations for 2,3 and 5 clusters. The function used to picture the results is fviz_cluster. This function visualizes the cluster in 2 dimensions and performs Principle Components Analysis (PCA) behind the scenes to reduce the dimensions such that data can be represented by clusters in a 2-D space.

set.seed(123)

k2 <- kmeans(scaled_data, centers = 2, nstart = 25)  # centers = number of clusters to divide customer list into and nstart =  number of random sets to be chosen)
k3 <- kmeans(scaled_data, centers = 3, nstart = 25)
k5 <- kmeans(scaled_data, centers = 5, nstart = 25)

par(mfrow=c(4,2))

# plots to compare
p7 <- fviz_cluster(k2, geom = "point", data = scaled_data) + ggtitle("2 Clusters")
p8 <- fviz_cluster(k3, geom = "point",  data = scaled_data) + ggtitle("3 Clusters")
p9 <- fviz_cluster(k5, geom = "point",  data = scaled_data) + ggtitle("5 Clusters")

grid.arrange(p7, p8, p9, nrow = 2)

4.3 Final number of clusters

The Elbow method suggest that 5 is the optimal number of clusters. Silhouette method suggests show that 2 clusters maximize the average silhouette values with max 4 clusters coming in as second optimal number of clusters. Gap statistic suggests 3 clusters. In this step, the final approach will be analyzed in detail

set.seed(123)
print(k2)
## K-means clustering with 2 clusters of sizes 41, 399
## 
## Cluster means:
##         Fresh       Milk    Grocery      Frozen Detergents_Paper Delicassen
## 1  0.05283636  2.0659269  2.2407190  0.32219794        2.2585338  0.8039597
## 2 -0.00542930 -0.2122882 -0.2302493 -0.03310806       -0.2320799 -0.0826124
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 1 2 1 1 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2
##  [75] 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [149] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2
## [334] 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [408] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 966.3860 982.9619
##  (between_SS / total_SS =  26.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
print(k3)
## K-means clustering with 3 clusters of sizes 13, 322, 105
## 
## Cluster means:
##        Fresh       Milk    Grocery      Frozen Detergents_Paper  Delicassen
## 1  1.2628701  3.8420545  3.4733327  1.70013636        3.2968964  2.37006056
## 2  0.1121903 -0.3514030 -0.4256645  0.04403387       -0.4182375 -0.12285483
## 3 -0.5004055  0.6019528  0.8753395 -0.34553027        0.8744079  0.08331873
## 
## Clustering vector:
##   [1] 2 3 3 2 2 2 2 2 2 3 3 2 3 3 3 2 3 2 2 2 2 2 2 1 3 2 2 2 3 2 2 2 2 2 2 3 2
##  [38] 3 3 2 2 2 3 3 3 3 3 1 3 3 2 2 2 3 2 2 3 3 2 2 2 1 2 3 2 1 2 3 2 2 2 3 2 2
##  [75] 2 2 2 3 2 2 2 3 3 2 2 1 1 2 2 2 2 2 1 2 3 2 2 2 2 2 3 3 3 2 2 2 3 3 2 3 2
## [112] 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [149] 2 2 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 3 3 2 2 2 3 3 2 3 2 3 2 2 2 2 2 1 3 1 2
## [186] 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 2 3 2 2 2 3 2 1 2 2 3 3 3 2 3 2 2 3
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 1 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 3 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## [297] 2 2 3 2 2 3 3 3 3 3 3 2 2 3 2 2 3 2 2 3 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 3 2
## [334] 1 2 2 2 2 2 2 3 3 3 3 2 2 3 2 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2
## [371] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## [408] 3 2 2 2 2 3 2 2 2 3 3 3 2 3 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 3 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 693.5608 677.2089 239.5571
##  (between_SS / total_SS =  38.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
print(k5)
## K-means clustering with 5 clusters of sizes 10, 269, 97, 1, 63
## 
## Cluster means:
##        Fresh       Milk    Grocery       Frozen Detergents_Paper Delicassen
## 1  0.3134735  3.9174467  4.2707490 -0.003570131        4.6129149  0.5027930
## 2 -0.2281097 -0.3850613 -0.4383243 -0.163797758       -0.3991069 -0.1945037
## 3 -0.4962279  0.6810009  0.9032545 -0.332321693        0.8994410  0.1018261
## 4  1.9645810  5.1696185  1.2857533  6.892753825       -0.5542311 16.4597113
## 5  1.6570840 -0.1082488 -0.2174555  1.102218231       -0.4041420  0.3326463
## 
## Clustering vector:
##   [1] 2 3 3 2 5 2 2 2 2 3 3 2 5 3 3 2 3 2 2 2 2 2 5 3 3 2 2 2 3 5 2 2 2 5 2 3 5
##  [38] 3 3 5 5 2 3 3 3 3 3 1 3 3 2 2 5 3 2 2 1 3 2 2 2 1 2 3 2 1 2 3 2 2 5 5 2 5
##  [75] 2 2 2 3 2 2 2 3 3 2 2 1 1 5 2 5 2 2 1 5 3 2 2 2 2 2 3 3 2 5 2 2 3 3 2 3 2
## [112] 3 5 2 2 2 2 2 2 2 2 2 2 2 5 5 5 2 2 5 2 2 2 2 2 2 2 2 2 2 2 5 5 2 2 3 2 2
## [149] 2 5 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 3 3 2 2 2 3 3 2 3 2 3 5 2 2 2 2 5 3 4 2
## [186] 2 2 2 3 3 2 2 2 3 2 5 5 3 2 2 3 3 5 2 2 3 2 2 2 3 2 1 2 2 3 3 3 2 3 2 2 3
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 2 2 2 3 3 2 2 2 2 2 1 2 5 3 5 2 2 5
## [260] 5 2 2 2 2 3 3 3 2 3 2 2 2 2 5 2 2 5 5 2 2 2 2 5 5 5 5 2 2 2 5 2 2 2 3 2 2
## [297] 2 2 2 2 2 3 3 3 3 3 3 2 2 3 2 5 3 2 2 3 2 2 2 3 2 2 2 2 2 5 2 2 2 2 2 3 2
## [334] 1 5 5 2 2 2 2 3 3 2 3 2 2 3 5 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 5 5 2 2 2 2 3 5 2 2 5 5 5 2 3 2 2 2 2 2 2 2 2 5 2 2 3 2 2 2 2 5 2 2 2 2 5
## [408] 3 2 2 2 2 2 5 2 2 3 2 3 2 3 2 2 2 2 5 3 5 2 2 2 5 2 2 2 5 5 3 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 149.4481 235.0199 231.7329   0.0000 440.1481
##  (between_SS / total_SS =  59.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
From the above analysis we can see that cluster = 3 or 5 will be the optical number of cluster, because the percentage related to sum of squares is big. So, analyzing in details with percentages:
center.k3 <- kmeans(data2, centers = 3)
center.k5 <- kmeans(data2, centers = 5)
center.k3 <- center.k3$centers
center.k5 <- center.k5$centers
round(prop.table(center.k3, 2) * 100)
##   Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1    16   13      13     23               12         18
## 2    15   65      70     18               82         35
## 3    69   21      16     60                7         47
round(prop.table(center.k5, 2) * 100)
##   Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1    21    6       6     18                3         11
## 2    48    8       7     37                3         16
## 3    21   61      59     28               66         56
## 4     6    5       5     11                4          6
## 5     5   19      23      7               24         10

Cluster 3 - Segmentation

  • Persona or Customer 1 : Customers who spend small amounts of money (less than $25.000), but on all types of products (Fresh, Milk, Grocery, Frozen, Detergents_Paper and Delicassen)
  • Persona or Customer 2 : Customers who spend big amounts (more than $30.000) of money on fresh-type products (Fresh and Frozen)
  • Persona or Customer 3 : Customers who spend big amounts (more than $30.000) of money on Milk, grocery and detergents paper products

Cluster 5 - Segmentation

  • Persona or Customer 1 : Customers who spend small amounts of money (less than $25.000), but on all types of products (Fresh, Milk, Grocery, Frozen, Detergents_Paper and Delicassen)
  • Persona or Customer 2 : Customers who spend big amounts (more than $30.000) of money on Milk, grocery and detergents paper products
  • Persona or Customer 3 : Customers who spend small amounts of money (less than $25.000), but on all types of products (Fresh, Milk, Grocery, Frozen, Detergents_Paper and Delicassen)
  • Persona or Customer 4 : Customers who spend medium amounts (less than $30.000 and more than $25.000) of money on Milk, grocery and detergents paper products
  • Persona or Customer 5 : Customers who spend big amounts (more than $30.000) of money on fresh-type products (Fresh and Frozen).
From the above analysis, it can be seen that the optical number of cluster is 3, as some of the clusters suggested with 5 can be included in some other clusters.

5. Summary & Findings

Based on the analysis performed, the Personas or customer segmentation of Wholesales Service based on the customer spending habits , should be:

  • Persona or Customer 1 : Customers who spend small amounts of money (less than $25.000), but on all types of products (Fresh, Milk, Grocery, Frozen, Detergents_Paper and Delicassen).
  • Persona or Customer 2 : Customers who spend big amounts (more than $30.000) of money on fresh-type products (Fresh and Frozen).
  • Persona or Customer 3 : Customers who spend big amounts (more than $30.000) of money on Milk, grocery and detergents paper products.

Customer Segmentation is an important marketing strategy that organizations should deploy for their products and services. It involves “dividing a large, heterogeneous market into smaller segments of consumers with distinct needs, characteristics, or behaviors that might require separate strategies”

And based on these segmentation, this will allow to build the Personas, the Personas are consumer group or cluster that shares similar values, behaviors, and goals. Personas begin with those basic profiles, but then are given names, faces, person. Personas help to understand the emotional and behavioral triggers behind individual consumers.

Have an understanding on this point is key for having a breakthrough Consumer Experience. And K-Means clustering is a simple but powerful machine learning algorithm that allow us to have this point ready.