Preface

Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. One of the advantage of customer segmentation is to develop customized market campaigns and develop an optimal distribution strategy, which will be our focus in this section.

In this section, we’re going to see and look for the optimal number of customer group from using coupon discount by their behavioral data, such as spending, counted transaction and using discount. As most common ways in businesses to segment their customer base are behavioral data (transaction), demographic information, geographical information, and psychographics. We’re going to segment the data from their behavior using k-means.

A common cluster analysis process is an algorithm known as ‘k-means cluster analysis’. Since the process is not based on predetermined rules, the data reveals customer archetypes and trends that exist inherently within customer populations. Customer clustering and analysis can thus be used to effectively target customers with offers and incentives personalized to their needs and wants. It is an important tool for any business to operate in the modern day.

Objective: Customer Segmentation based on the discount used behaviour

Data source:

Kaggle

Library Preparation

library(tidyverse)
library(ggplot2)
library(cowplot)
library(plotly)
library(GGally)
library(FactoMineR)
library(factoextra)

Read Data & Data Descriptions

Mind Map

From the data source link above, we get some files (csv) that contain different information where we can use based on our objective analysis. But for this section our main focus is to analyze Customer Transaction Data.

knitr::include_graphics("Kaggle DataSet/coupon/Schema.png")

Read Data

As I’ve mentioned our main data set which we’re going to use is Customer Transaction Data file. Let’s read the customer_transaction_data.csv and put it into a new variable called transaction.

transaction <- read.csv("Kaggle DataSet/coupon/customer_transaction_data.csv")
transaction

Data Descriptions: Transaction Data Frame

  1. date : Date of transaction
  2. customer_id : Unique customer ID
  3. item_id : Unique item ID
  4. quantity : Quantity if item bought
  5. selling_price : Sales value of the transaction
  6. other_discount : Discount from other sources (such as manufacturer coupon/loyalty card)
  7. coupon_discount : Discount availed from retailed coupon

Data Wrangling

In this section we’re going to inspect what adjusments are needed to clean our data.

Now let’s check our data :

  1. View data frame
transaction

Notice that from the transaction data frame, we got the information of discount used by customer in other_dicount & coupon_discount columns. We’re going to distinct and count how many times does a customer use each type of discount from one transaction.

  1. Check data frame summary
summary(transaction)
#>      date            customer_id      item_id         quantity      
#>  Length:1324566     Min.   :   1   Min.   :    1   Min.   :    1.0  
#>  Class :character   1st Qu.: 418   1st Qu.:14684   1st Qu.:    1.0  
#>  Mode  :character   Median : 801   Median :26597   Median :    1.0  
#>                     Mean   : 804   Mean   :29519   Mean   :  130.7  
#>                     3rd Qu.:1198   3rd Qu.:42406   3rd Qu.:    1.0  
#>                     Max.   :1582   Max.   :74066   Max.   :89638.0  
#>  selling_price      other_discount     coupon_discount     
#>  Min.   :    0.36   Min.   :-3120.31   Min.   :-1992.2300  
#>  1st Qu.:   49.16   1st Qu.:  -23.15   1st Qu.:    0.0000  
#>  Median :   78.01   Median :   -1.78   Median :    0.0000  
#>  Mean   :  114.60   Mean   :  -17.77   Mean   :   -0.5949  
#>  3rd Qu.:  124.31   3rd Qu.:    0.00   3rd Qu.:    0.0000  
#>  Max.   :17809.64   Max.   :    0.00   Max.   :    0.0000

Highlight Point: As the discount value is dicribed in minus number, other_dicount & coupon_discount columns maximum value is 0. Means that both column have consistent value and have no error. We also get the information that the value range among variables is wide (-3.120 to 89.638)

  1. Check data frame structure
str(transaction)
#> 'data.frame':    1324566 obs. of  7 variables:
#>  $ date           : chr  "2012-01-02" "2012-01-02" "2012-01-02" "2012-01-02" ...
#>  $ customer_id    : int  1501 1501 1501 1501 1501 1501 857 857 857 67 ...
#>  $ item_id        : int  26830 54253 31962 33647 48199 57397 12424 14930 16657 10537 ...
#>  $ quantity       : int  1 1 1 1 1 1 1 1 1 3 ...
#>  $ selling_price  : num  35.3 53.4 106.5 67.3 71.2 ...
#>  $ other_discount : num  -10.7 -13.9 -14.2 0 -28.1 ...
#>  $ coupon_discount: num  0 0 0 0 0 0 0 0 0 0 ...

As this data frame saving the customer transaction history, we’re going to merge all identical ID into 1 summary transaction, (each row identify 1 customer to it’s summary transaction). As we’re going to analyze a cluster from a customer.

  1. Check NA value
anyNA(transaction)
#> [1] FALSE

No NA value in our data

  1. Check duplicated value
anyDuplicated(transaction)
#> [1] 5466

there are rows that indicated as a duplicated data. But is it really duplicated or a repeat order? Let’s recheck:

  • subset all rows that indicated as duplicate data
transaction[duplicated(transaction), ]
  • take sample (customer_id = 814, item_id = 25251):
transaction[transaction$customer_id == 814 & transaction$item_id == 25251,]
  • take another sample (customer_id = 161, item_id = 28375):
transaction[transaction$customer_id == 161 & transaction$item_id == 28375,]

As we’ve rechecked the date of the transaction, it’s not duplicated data but a repeat order. We can keep all the data.

After the inspection, now we can conclude that these are some steps we need to adjust to our data:

Data Adjustment:

  1. Drop date column (in this section we’re not going to analyze based on transaction date)
  2. Change customer_id column into factor (categorical data, as we’re going to merge all transaction based on their ID)
  3. Add new column to identify whether a customer used discount coupon from a transaction
  4. Add new column to identify whether there’s discount coupon available on a certain item (check from coupon data frame) when the transaction is being made
  5. Add new column to identify whether a customer used coupon when there was discount coupon available on a certain item
  6. Add new column to identify whether a customer used other discount

Feature Engineering

  1. Add column to identify whether a customer used discount coupon

To identify whether a customer used discount coupon in a certain transaction, we can see it from used_coupon value. If the used_coupon is not 0 (minus value) then, a customer used a discount coupon.

Therefore we can use ifelse() function, as it is only contained 2 condition:

New column name: used_coupon

Condition :

  • If coupon_discount is NOT 0 : a customer used discount coupon in a transaction (as TRUE: described by 1)
  • else condition (coupon_discount is 0) as FALSE (described by 0)
transaction$used_coupon <- ifelse(transaction$coupon_discount != 0 , 1, 0 )
transaction[transaction$used_coupon == 1, ]
  1. Add column to identify whether there’s discount coupon available on an item

The information of discount item id is stored in coupon_item_mapping.csv. Let’s read the data and put the data frame into coupon variable:

coupon <- read.csv("Kaggle DataSet/coupon/coupon_item_mapping.csv")
coupon

Notice that coupon and transaction data frame, has same information of item_id column, therefore we can check whether there’s discount coupon available on an item in a transaction data frame by cross checking between both data frame.

New column name: coupon_available

Condition :

  • If there is identical item_id in the transaction data frame and in COUPON data frame :then the same item_id in a transaction data frame is discount item / there’s discount coupon available (as TRUE: described by 1)
  • else condition (item_id in the transaction data frame that IS NOT in COUPON list data frame) as FALSE (described by 0)
transaction$coupon_available <- ifelse(transaction$item_id %in% coupon$item_id, 1 , 0)
transaction[transaction$coupon_discount < 0,]

Now as we can see above, we got the information:

when the coupon_discount is minus (coupon discount is used by customer)

then the used_coupon is used (identifier as TRUE)

because the coupon_available is available (cross check discount item list from coupon data frame).

And yet there are some inconsistent values in coupon_available column, where the coupon_discount is minus but the item_id is not on the list in coupon data frame (0). Whoever, it’s more logical to say that customer indeed used the coupon that was why the coupon_discount is minus, rather than we’re assuming that the customer did not use the discount just because there’s not item_id on the coupon data frame. This may caused by the human error (data imputation process) or maybe the coupon data frame item list is contain a discount item list in a certain period of time. Because this problem is not described in the data source, let’s moving on to our data processing and assume that indeed a customer used a coupon discount.

Conclusion: replace coupon_available to 1 according to match condition (coupon_discount is minus)

# Recheck all rows that match to the condition
transaction[transaction$coupon_discount != 0.00 & transaction$coupon_available == 0,]
# Replace inconsistent coupon_available column into 1
transaction[transaction$coupon_discount != 0.00 & transaction$coupon_available == 0, "coupon_available"] <- 1

# Recheck the row
transaction[c(113,114, 130, 152, 154),]
  1. Add new column to identify if a customer didn’t use discount coupon even there was coupon discount available

Now we’re going to create new column to identify a customer that didn’t use the discount coupon even it was available.

New column name: notused_coupon_available

Condition :

  • If coupon_avalaible is 1 (TRUE/AVAILABLE) and coupon_discount is 0 (as TRUE customer did not use discount coupon: described by 1)
  • else condition as FALSE (described by 0)
transaction$notused_coupon_available <- ifelse(transaction$coupon_available == 1 & transaction$coupon_discount == 0.00, 1, 0 )

transaction
  1. Add new column to identify whether a customer used other discount than a discount coupon

We’re going to create a new column to identify whether a customer used other discount than a discount coupon to check the frequency of a customer using other discount rather a coupon discount. This idea may help to understand more about our customer behavior.

New column name: used_other_discount

Condition :

  • If other_discount is minus then a customer is using other_discount (as TRUE customer used other discount: described by 1)
  • else condition (other_discount is 0) as FALSE (described by 0)
transaction$used_other_discount <- ifelse(transaction$other_discount < 0 , 1, 0)
transaction[, c("other_discount","used_other_discount" )]

Data Preprocessing

Now we’d like to know about the frequency of using discount coupon from a certain customer. But to get this information, there are more preparations need to be applied to our data frame.

transaction

As I’ve mentioned in the previous sub chapter, we’re going to merge all transaction according to it’s customer_id to summarize customer transaction and coupon / discount used, so we can get the information the frequency of how often a customer is using a discount coupon.

Therefore in this process we’re going to :

  • Drop date, and item_id as those column would not included in further processing
  • Remove minus sign in discount columns (to sum discount amount used).
  • Change customer_id data type into factor (to sum all transaction according to match customer_id)
  • Create new column to count how many transaction is being made by a certain customer (count_trx), because each row is identifying of 1 transaction, we can fill the count_trx value to 1 (then it will be summed to each unique id so we get the information of how many transaction is made by a customer )

These adjustments are going to be saved in a new variable called trx.sum (transaction summary)

trx.sum <- transaction %>% 
  select(-c(date, item_id)) %>% 
  mutate(count_trx = 1, 
    other_discount = str_remove(other_discount, "-"),
         coupon_discount = str_remove(coupon_discount, "-"),
         customer_id = as.factor(customer_id)) %>%
  mutate_at(vars(other_discount, coupon_discount), as.numeric) %>% 
  group_by(customer_id) %>% 
  summarise_all(sum)

trx.sum

As we can see above from the trx.sum data frame, we can see clearer about customer transaction and discount used behavior.

Here the interpretation we can get from trx.sum data frame:

  1. Total item has been purchased by a customer (quantity)
  2. Total amount of customer spending (real items price without discount) (selling_price)
  3. Total amount of other discount used by a customer (other_discount)
  4. Total amount of coupon discount used by a customer (coupon_discount)
  5. Count of discount coupon used by a customer (used_coupon)
  6. Count of discount coupon is available to each customer transaction (coupon_available)
  7. Count of discount coupon is not used by a customer when the coupon is available (notused_coupon_available)
  8. Count of other discount used by a customer to each transaction (used_other_discount)
  9. Count of transaction is being made by a customer (count_trx)

And yet for this section, it’s more suitable to change the used_coupon column into used coupon proportion to simplify our data frame and yet keep the exact data information.

Now let’s change used_coupon column into proportion to each customer_id transaction:

Steps :

  • Change used_coupont value into proportion of used_coupon compared to coupon_available from each customer transaction (count into percentage and round the value into 2 number decimal)
  • Drop column coupon_available, notused_coupon_available
trx.sum <- trx.sum %>% 
  mutate(used_coupon = round((used_coupon/coupon_available)*100, 2)) %>% 
  select(-c(coupon_available, notused_coupon_available))

trx.sum

EDA

EDA (Exploratory Data Analysis) is the process of performing initial investigations on data so have better understanding to our data, discover patterns, spot anomalies, or check assumptions. Let’s see what kind of information we can get from this process.

Data Distribution

  1. Discount Coupon Used Frequency

To get the glimpse about our customer behavior at using discount coupon let’s see our data distribution from the transaction data frame.

Note: trx.sum and trx.all merged all transaction from each customer id, so we can use the transaction data frame

table(transaction$used_coupon) %>% 
  barplot(names = c("Not Used Coupon", "Used Coupon"), 
          main = "Coupon Discount Used Frequency",
          col = c("#b33126", "#4fc95d"))

From all transaction history, the frequency of a customer using a discount coupon is very low

  1. Count Discount Coupon Availability of an Item

Now let’s check how many discount coupon are available to a specific item :

Let’s count unique item_id from the coupon data frame and sum them to the item_id from transaction data frame which are not in the coupon data frame (yet coupon_discount transaction are minus).

Note we’re not going to count only the coupon_available column in the transaction data frame to avoid undetected item_id that is actually discount but none of the customer purchase the item

#count unique discount item_id in coupon data frame
length(unique(coupon$item_id)) + #sum
  
  # Count unique discount item_id that are not in the coupon data frame, and coupon available is TRUE 
  transaction[!transaction$item_id %in% coupon$item_id & transaction$coupon_available == 1, "item_id" ] %>% 
  unique() %>% 
  length() 
#> [1] 38737

As we can see from above calculation, our data shows that there are 38.737 items that has discount coupon.

Now, let’s check other discount used frequency:

table(transaction$used_other_discount) %>% 
  barplot(names = c("Not used other discount", "Used other discount"),
          main = "Other Discount Used Frequency",
          col = c("#b33126", "#4fc95d"))

We can conclude that:

counted from each transaction, the frequentcy of a customer using discount coupon is very low, but when using other discount the frequency is slightly higher than the not used other discount. We can assume that in general there’s a tendency of customer is more frequent at using other discount than a coupon discount

Inspect other variable distribution:

par(mfrow = c(2,2))

hist(trx.sum$quantity, labels = T, xlab = NULL, main = "Total Item Purchased Distribution",col = "#49abd0")
abline(v = median(trx.sum$quantity), col = "red")


hist(trx.sum$other_discount, labels = T, xlab = NULL, main = "Total Other Discount Distribution", col = "#bebce3")
abline(v = median(trx.sum$other_discount), col = "red")


hist(trx.sum$coupon_discount, labels = T, xlab = NULL, main = "Total Coupon Discount Distribution", col = "#bebce3")
abline(v = median(trx.sum$coupon_discount), col = "red")


hist(trx.sum$used_coupon, labels = T, xlab = NULL, main = "Discount Coupon Used Percentage Distribution", col = "#49abd0")
abline(v = median(trx.sum$used_coupon), col = "red")

1. Total Item Purchased Distribution Insight

right skewed distribution

  • There’s one customer that made total item purchased above 2.500.000 items (extreme outliers)
  • Majority of total items of a customer transaction is below 500.000 items
  • The average of total items purchased is at 18.768 items
  • There are some customers with total item purchased is 0

2. Total Other Discount Distribution Insight

right skewed distribution

  • There’s one customer that used other discount with total amount above 100.000 (extreme outliers)
  • The majority of total amount other discount used by a customer is at range 0 - 20.000
  • The average of total other discount used is at 11.628
  • There are some customers with total other discount used is 0

3. Total Coupon Discount Distribution

right skewed distribution

  • The highest amount of total coupon discount used is 12.000 (extreme outliers)
  • The lowest amount of total coupon discount used is 0
  • The average of the amount total discount coupon used is 145.86
  • The majority of total amount discount coupon used by a customer is at range 0 - 2.000

4. Discount Coupon Used Percentage Distribution

right skewed distribution

  • The highest percentage of coupon discount used from a certain customer transaction is 50% (extreme outliers)
  • The lowest percentage of coupon discount used is 0%
  • The average chance of discount coupon used in a transaction is 0.84%
  • The majority percentage of coupon discount used by a customer is at range 0 - 10%

Customer Transaction Frequency Distribution :

hist(trx.sum$count_trx, main = "Customer Total Transaction Frequency", xlab = "Transaction Count", col = "orange", labels = T)
abline(v = median(trx.sum$count_trx), col = "red")

The majority of total transaction frequency from each customer is at the range 0 - 2.000 transactions

The average of total transaction frequency from one customer is approximately at 664 transactions

Now let’s check our variable correlation:

ggcorr(trx.sum, label = T, hjust = 1, layout.exp = 2)

There’s high multicolinearity (corelation between variable)

This indeed will happens as some of the new variables are created from the initial data frame.

PCA Object

Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a data set with many variables.

As you’ve noticed our trx.sum data frame has 8 variable, if we’re going to visualize our customer cluster it’ll produce a complicated visualization (imagine that 1 variable is defined by 1 axis of a plot) because we consider all variable is contribute to define a cluster. To handle this issue we can use PCA to apply the dimension reduction without changing/loosing our data information. A PC at PCA will summary some of of our variance variables considering their correlation. This will help us to reduce our dimension so we can visualize into 2 dimension plot (from matrix to vector) (2 axis).

Let’s create a PCA object:

Steps

  • Create PCA object with PCA function

  • parameter: scale unit set as TRUE

  • Note that our data has different length of scale (example selling_price column and used_coupon column, where selling_price max number can reach unlimited and used_coupon max number is could only reach to 100)

Change customer.id as row names as it would not be involved in clustering process (but as the identifier), so that each row represent a customer transaction

#Set as data frame
trx.sum <-  as.data.frame(trx.sum)


#Optional (personal preference) add prefix Id- to customer.id to get clearer information
trx.sum <- trx.sum %>% 
  mutate(customer_id = paste0("Id-", customer_id ))


#Set customer.id as row names (customer.id as identifier)
rownames(trx.sum) <- trx.sum$customer_id

#Drop column customer.id
trx.sum <- trx.sum %>% 
  select(-customer_id)

trx.sum
  • parameter: quali.sup fill with the index column of our categorical/factor data type
  • parameter: graph set as FALSE (we’re not going to see the graph immediately after creating the PCA object or if you’d like to, you can set as TRUE)
  • parameter: ncp is not included. Ncp parameter is to set the number of dimension we’d like to create. I’m not going to define a specific number so our data will be reduced to 5 dimensions (5 is the default number for ncp in PCA function).
#Create PCA object
trx.pca <- PCA(
  X = trx.sum, 
  scale.unit = T, 
  graph = F
  )

Now our data dimension has been reduce (summarized). Let’s check our PCA object.

trx.pca$var$coord
#>                         Dim.1       Dim.2       Dim.3       Dim.4       Dim.5
#> quantity            0.6195792 -0.08112585  0.77501704 -0.03739131  0.07066444
#> selling_price       0.9326342 -0.22322195  0.08803342  0.08028152 -0.18967384
#> other_discount      0.9259341 -0.07321597 -0.20712055 -0.07474909  0.26435240
#> coupon_discount     0.6049229  0.72486811 -0.05684311 -0.30211298 -0.11789303
#> used_coupon         0.3008681  0.90824637  0.03714291  0.28358743  0.05227564
#> used_other_discount 0.9428689 -0.16422719 -0.20151272  0.01554260  0.06805377
#> count_trx           0.9323935 -0.24739076 -0.16870217  0.10755580 -0.12895465

As expected, now our data has 5 dimension (summarized from 8 numeric variable from our trx.sum data frame). From above dimension description we can see some variables that define our first PC (dim 1) significantly, those variable are:

  • selling_price (0.9326342), other_discount (0.9259341), used_other_discount (0.9428689), count_trx (0.9323935)

Now you may wonder how this number could describe our data? Let’s see the visualization of our data distribution after reshape it into PCA object

# Plotting preparation

#Create individual PCA Plot
plot.ind <- plot.PCA(x = trx.pca,
         choix = "ind",
         select = "contrib 10" #Showing 10 extreme outliers
         )

#Create variable PCA Plot
plot.var <- plot.PCA(trx.pca, choix = "varcor")


#Merge Plot
plot_grid(plot.ind, plot.var, labels = c("A. Individual Plotting", "B. Variable Plotting"))

As we can see from above plots, now our data can be distributed into only 2 axes (dim1 as axis x and dim2 as axis y). Notice in both plot has 0 zero line in axis x and axis y, this line will help us to define which variable that contributed to our PC and also our data distribution point. (the direction where the arrow is pointing, means where the variable is most contribute to a PC. In the other words, if the arrow pointing to x axis/parallel with dim 1 axis, means that the particular variable is contributed to PC 1 )

Let’s take an example from the dot Id-1555. How can Id-1555 is at that certain point? Pay attention to plot B, where used_other_discount arrow is at, from this position we can back to the plot A where the ID-1555 at. We can assume that the Id-1555 position is approximately in between used_other_discount and selling_price. This means that:

  1. Id-1555 is described most with the dim 1 (PC 1),remember that used_other_discount define significantly to PC 1
  2. Id-1555 is considered as the outlier from the count_trx variable (the value of used_other_discount/ selling_price is considered either very high from the majority o f population). Let’s compare to other Id, say Id-6 that is not considered as extreme outlier especially at the used_other_discount variable
#Subset used_other_discount Id-1555 & Id-6 to check 
trx.sum[c("Id-1555", "Id-6"), "used_other_discount"]
#> [1] 2629  234
#Subset selling_price Id-1555 & Id-6 to check 
trx.sum[c("Id-1555", "Id-6"), "selling_price"]
#> [1] 518142.14  59508.22
  • Compare to Id-6 (total amount of other discount is 234) indeed that Id-1555 has far higher value therefore considered as the outliers
  • Compare to Id-6 (total amount of selling price is 59508.22) indeed that Id-1555 has far higher value therefore considered as the outliers
  1. The position where Id-1555 is located at the plot A (below 0 in the dim2 and above 5 in the dim1), is calculated from trx.pca$var$coord number to each original value of a variable from the data frame before we reshape it into PCA object.

To get clearer visualization about the variable that significantly contributed to our PC we can see from below plot:

Note: focused to PC 1 & PC 2, as both PCs most describe our data information & and also we’re going to create 2d plot to get the information summary in our data)

  1. Variables contribution to PC 1
fviz_contrib(X = trx.pca,choice = "var",axes = 1, fill = "#a1d2d7")

Although there are 7 variables that contributed to PC 1, we can say that variables that significantly contribute to PC 1 is : used_other_discount, count_trx, selling_price, other_discount (exceed the red line)

  1. Variables contribution to PC 2
fviz_contrib(X = trx.pca,choice = "var",axes = 2, fill = "#c8c7ef")

Although there are 7 variables that contributed to PC 2, we can say that variables that significantly contribute to PC 2 is : used_coupon & coupon_discount (the bar which exceeds the red line)

Now let’s get back to our PCA object:

trx.pca$eig
#>        eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 4.32585160             61.7978800                          61.79788
#> comp 2 1.50028802             21.4326860                          83.23057
#> comp 3 0.72497875             10.3568393                          93.58741
#> comp 4 0.19693457              2.8133509                          96.40076
#> comp 5 0.14874395              2.1249135                          98.52567
#> comp 6 0.08843461              1.2633515                          99.78902
#> comp 7 0.01476851              0.2109787                         100.00000

From information we get the summary how the variance is cumulative. As you remember that our PC is formed by the summary of our variable variable, the trx.pca$eig accumulate the variance from combined PC. Notice our total variables is 7. If we use all 7 PC we’ll get the 100% of our information. But when we created our PCA we only use 5 PC, means that we get the 98.5% of data information. 98.5% is pretty high number for the cumulative percentage of variance (remember PC summarize our variable information). Again, how much information we’re keeping is defined by our needs or maybe user request.

This PCA object can help us to identify some outliers in our data frame. let’s check out individual plot that shows us 10 most extrme outliers:

And the extreme outliers are:

ID : Id-22, Id-70, Id-1333, Id-367, Id-1574, Id-748, Id-1558, Id-464, Id-1555, Id-1011

plot.ind

These extreme outliers will be excluded to our clustering process (avoiding the bias of the clustering).

K-Mean Clustering

K-means is the process to group similar data points together and discover underlying patterns. A cluster refers to a collection of data points aggregated together because of certain similarities. We’ll define a target number k, which refers to the number of centroids we need in the data set. A centroid is the imaginary or real location representing the center of the cluster

To create k-means clustering let’s remove our extreme outliers that appeared on previous PCA Individual plot :

etreme.outliers <- c("Id-22", "Id-70", "Id-1333","Id-367", "Id-1574", "Id-748", "Id-1558", "Id-464", "Id-1555", "Id-1011")
trx.clean <- trx.sum[!rownames(trx.sum) %in% etreme.outliers, ]

After we clean our data from 10 most extreme outliers we can define the optimal k (cluster) we may to apply to our data. The number of k can be defined according to the business needs, or in our case will be defined with the statistic calculation using the elbow method.

Elbow Method

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. As our objective is to group our customer into some similar characteristic, we can use elbow method to help us determine the optimal number of cluster.

we can use elbow method with fviz_nbclust() function.

fviz_nbclust(
  x = scale(trx.clean),
  FUNcluster = kmeans,
  method = "wss"
)

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion. Thus for the given data, we can conclude that the optimal number of clusters for the data is 4. Notice that after the number of k is 4 the linear fashion start appearing.

Creating k-means object

After we’ve decided the number of k(cluster) to our k-means model, let’s create the k-means and inspect the clustering result:

#Create k-means object
RNGkind(sample.kind = "Rounding")
set.seed(78)
trx.cluster <- kmeans(scale(trx.clean), 4)


#Check groups size
trx.cluster$size
#> [1] 451 124 920  77

As we’ve segmented our customer with kmeans() function, now we have 4 type of customer based on their transaction behavior. From the trx.cluster$size we get the total customer to each group:

  1. Cluster 1 : 451 customers
  2. Cluster 2 : 124 customers
  3. Cluster 3 : 920 customers
  4. Cluster 4 : 77 customers

Before we inspect the characteristic of all cluster, let’s see our model performance

  1. As mentioned above, k-means method will define a centroid as the imaginary location representing the center of the cluster. To set a centroids, kmeans will randomly and search the most suitable centroid for our data. The $iter tell us how many random steps have been taken until it find the best version for our data cluster, and for our case the iteration is 4
trx.cluster$iter
#> [1] 4
  1. Within Sum of Square Value (WSS)

WSS value shows us the the distance of one component/observation to it;s centroid. The wss value among our clusters is not so much different. And yet for cluster 3, the wss value is the lowest, indicating that each observation has more close similarities than other 3 clusters.

trx.cluster$withinss
#> [1] 1048.1855 1133.0947  822.6827 1096.5568
  1. Between Sum of Square Value (BSS)

BSS value show us the weighted sum of square distance each centroid to global average. Our BSS value considered high, means that each cluster has wide differences of characteristic. That’s a good point, so each cluster can be differs from another.

trx.cluster$betweenss
#> [1] 6896.48

2.BSS / TOTSS (Total of Sum Square)

The value when BSS / TOTTS shows us that each cluster is good at describing the dominate characteristic in it’s cluster and good at describing the our data distribution. The closer the value to 1, the better our model. Our BSS/TOTSS value is 0.62, it’s quite good.

trx.cluster$betweenss/
trx.cluster$totss
#> [1] 0.6271238

Cluster Characteristic Inspection

After we inspect our model, now we can proceed to see our data clustering. In this section we’re going to see each cluster characteristic.

Steps

  1. Put the cluster classification into out trx.sum data frame
  2. Group data into cluster and calculate the average of each variable
  3. Inspection
# Pur clustering result into trx.sum data frame
trx.clean$clustering <- as.factor(trx.cluster$cluster)

#see the example of customer clustering result
trx.clean[1:6, 6:8]
# Group and merge each cluster to average variables
profile.average <- trx.clean %>% 
  group_by(clustering) %>% 
  summarise_all(mean) %>% 
  select(clustering, used_coupon, coupon_discount, used_other_discount, count_trx, everything())

profile.average

Clustering Profile

1. Cluster 1

  • In general cluster 1 is the 2nd lowest in all aspect, except the counted transaction (count_trx). Counted transaction is at 2nd rank highest, we can assume that cluster 1 is active at transaction
  • Discount coupon used percentage by cluster 1 is not much different with cluster 2 (0.1%), but the amount of coupon used mush lower than cluster 2. We can assume that the cluster 1 would use the coupon discount even the discount value is low.

Cluster 1 Profile Highlight: Considered high rate at transaction. There’s a chance of using discount even the discount value is low.

2. Cluster 2

  • The highest average at using other discount
  • Although that the average percentage of using coupon is low (1.95%) the average amount from coupon discount is twice higher than cluster 1 (even the difference of coupon used percentage is 0.1%)
  • Most frequent at transaction
  • Highest at total item purchased (quantity) and spending (selling_price)

Cluster 2 Profile Highlight: Highest at counted transaction, quantity, and selling_price. Prefer to use other discount. Can be assumed there’s a tendency of using coupon discount when the offer value of discount is high

3. Cluster 3

  • In general cluster 3 is the lowest rank of every aspect. We can assume that this cluster needed more engagement.
trx.clean %>% 
  filter(coupon_discount == 0) %>% 
  select(clustering) %>% 
  group_by(clustering) %>% 
  table()
#> clustering
#>   1   2   3   4 
#>  37   3 292   0
  • as we can see from above calculation, when we subset the coupon_disount variable there are 292 customers from cluster 3 that never used coupon even the coupon discount is available. In the other word, the cluster 3 describe the group of customer that never used the discount coupon to rarely use coupon.

Cluster 3 Profile Highlight: Lowest rank on every aspect, the frequent of using discount coupon is never to rarely

4. Cluster 4

  • The most frequent at using discount coupon
  • There’s a tendency of purchasing items >1 with the low price (count_trx & selling_price is close to cluster 1, but quantity is twice higher than cluster 1)
  • Use both coupon and other discount
  • There is no customer from cluster 4 which never used discount coupon

Cluster 4 Profile Highlight: High rate at transaction. The most frequent of using coupon discount. Interested in discount coupon and other discount. Transacction behavior : purchase >1 items/transaction

We can also get the same information from below visualization:

scale.profile <- profile.average %>% 
  mutate_if(is.numeric, scale)

scale.profile %>% 
  pivot_longer(-clustering) %>% 
  ggplot(aes(x = clustering, y = value, fill = clustering)) +
  geom_col() +
  facet_wrap(~name)

Summary

freq.clus <- trx.clean %>% 
  select(clustering) %>% 
  table() %>% as.data.frame() 


ggplot(freq.clus, aes(x= clustering, y = Freq)) +
  geom_col(fill = c("#5c88bd", "#bfe0c9", "#f8998f", "#bdaed1"))+
  geom_text(aes(label = Freq, group = clustering), hjust = .5, vjust = -.5, fontface = "bold") +
  theme_minimal() +
  labs(x = "Cluster",
       y = "Frequency",
       title = "Cluster Frequency Distribution") +
  theme(plot.title = element_text(size = 18, face = "bold", family = "serif"),
        axis.title.y = element_text(size = 13, hjust = 0),
        axis.text = element_text(size = 12))

From the analysis and also above plot we can conclude:

  1. The optimal cluster to describe our customer is 4 cluster
  2. In general our customer is tend to use other discount rather coupon discount
  3. In general our customer is not interested in coupon discount (proven by: 3 out of 4 cluster use coupon discount below 2% from their transaction)
  4. The highest frequency of customer type is cluster 3 (lowest point in all aspect)
  5. The lowest frequency of customer type is cluster 1 (most frequent at using coupon discount)

Conclusion

It’s inefficient to use coupon discount to increase the transaction rate to our customer. And yet if the coupon discount campaign is needed, we can use distribute discount offer strategy to each cluster