Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. One of the advantage of customer segmentation is to develop customized market campaigns and develop an optimal distribution strategy, which will be our focus in this section.
In this section, we’re going to see and look for the optimal number of customer group from using coupon discount by their behavioral data, such as spending, counted transaction and using discount. As most common ways in businesses to segment their customer base are behavioral data (transaction), demographic information, geographical information, and psychographics. We’re going to segment the data from their behavior using k-means.
A common cluster analysis process is an algorithm known as ‘k-means cluster analysis’. Since the process is not based on predetermined rules, the data reveals customer archetypes and trends that exist inherently within customer populations. Customer clustering and analysis can thus be used to effectively target customers with offers and incentives personalized to their needs and wants. It is an important tool for any business to operate in the modern day.
Objective: Customer Segmentation based on the discount used behaviour
Data source:
library(tidyverse)
library(ggplot2)
library(cowplot)
library(plotly)
library(GGally)
library(FactoMineR)
library(factoextra)From the data source link above, we get some files (csv)
that contain different information where we can use based on our
objective analysis. But for this section our main focus is to analyze
Customer Transaction Data.
knitr::include_graphics("Kaggle DataSet/coupon/Schema.png")As I’ve mentioned our main data set which we’re going to use is
Customer Transaction Data file. Let’s read the
customer_transaction_data.csv and put it into a new
variable called transaction.
transaction <- read.csv("Kaggle DataSet/coupon/customer_transaction_data.csv")
transactionData Descriptions: Transaction Data Frame
date : Date of transactioncustomer_id : Unique customer IDitem_id : Unique item IDquantity : Quantity if item boughtselling_price : Sales value of the transactionother_discount : Discount from other sources (such as
manufacturer coupon/loyalty card)coupon_discount : Discount availed from retailed
couponIn this section we’re going to inspect what adjusments are needed to clean our data.
Now let’s check our data :
transactionNotice that from the transaction data frame, we got the information of discount used by customer in
other_dicount&coupon_discountcolumns. We’re going to distinct and count how many times does a customer use each type of discount from one transaction.
summary(transaction)#> date customer_id item_id quantity
#> Length:1324566 Min. : 1 Min. : 1 Min. : 1.0
#> Class :character 1st Qu.: 418 1st Qu.:14684 1st Qu.: 1.0
#> Mode :character Median : 801 Median :26597 Median : 1.0
#> Mean : 804 Mean :29519 Mean : 130.7
#> 3rd Qu.:1198 3rd Qu.:42406 3rd Qu.: 1.0
#> Max. :1582 Max. :74066 Max. :89638.0
#> selling_price other_discount coupon_discount
#> Min. : 0.36 Min. :-3120.31 Min. :-1992.2300
#> 1st Qu.: 49.16 1st Qu.: -23.15 1st Qu.: 0.0000
#> Median : 78.01 Median : -1.78 Median : 0.0000
#> Mean : 114.60 Mean : -17.77 Mean : -0.5949
#> 3rd Qu.: 124.31 3rd Qu.: 0.00 3rd Qu.: 0.0000
#> Max. :17809.64 Max. : 0.00 Max. : 0.0000
Highlight Point: As the discount value is dicribed in minus number,
other_dicount&coupon_discountcolumns maximum value is 0. Means that both column have consistent value and have no error. We also get the information that the value range among variables is wide (-3.120 to 89.638)
str(transaction)#> 'data.frame': 1324566 obs. of 7 variables:
#> $ date : chr "2012-01-02" "2012-01-02" "2012-01-02" "2012-01-02" ...
#> $ customer_id : int 1501 1501 1501 1501 1501 1501 857 857 857 67 ...
#> $ item_id : int 26830 54253 31962 33647 48199 57397 12424 14930 16657 10537 ...
#> $ quantity : int 1 1 1 1 1 1 1 1 1 3 ...
#> $ selling_price : num 35.3 53.4 106.5 67.3 71.2 ...
#> $ other_discount : num -10.7 -13.9 -14.2 0 -28.1 ...
#> $ coupon_discount: num 0 0 0 0 0 0 0 0 0 0 ...
As this data frame saving the customer transaction history, we’re going to merge all identical ID into 1 summary transaction, (each row identify 1 customer to it’s summary transaction). As we’re going to analyze a cluster from a customer.
anyNA(transaction)#> [1] FALSE
No NA value in our data
anyDuplicated(transaction)#> [1] 5466
there are rows that indicated as a duplicated data. But is it really duplicated or a repeat order? Let’s recheck:
transaction[duplicated(transaction), ]customer_id = 814, item_id =
25251):transaction[transaction$customer_id == 814 & transaction$item_id == 25251,]customer_id = 161,
item_id = 28375):transaction[transaction$customer_id == 161 & transaction$item_id == 28375,]As we’ve rechecked the date of the transaction, it’s not duplicated data but a repeat order. We can keep all the data.
After the inspection, now we can conclude that these are some steps we need to adjust to our data:
Data Adjustment:
date column (in this section we’re not going to
analyze based on transaction date)customer_id column into factor (categorical
data, as we’re going to merge all transaction based on their ID)To identify whether a customer used discount coupon in a certain
transaction, we can see it from used_coupon value. If the
used_coupon is not 0 (minus value) then, a customer used a
discount coupon.
Therefore we can use ifelse() function, as it is only
contained 2 condition:
New column name: used_coupon
Condition :
coupon_discount is NOT 0 : a
customer used discount coupon in a transaction (as TRUE: described by
1)coupon_discount is 0)
as FALSE (described by 0)transaction$used_coupon <- ifelse(transaction$coupon_discount != 0 , 1, 0 )
transaction[transaction$used_coupon == 1, ]The information of discount item id is stored in
coupon_item_mapping.csv. Let’s read the data and put the
data frame into coupon variable:
coupon <- read.csv("Kaggle DataSet/coupon/coupon_item_mapping.csv")
couponNotice that coupon and transaction data
frame, has same information of item_id column, therefore we
can check whether there’s discount coupon available on an item in a
transaction data frame by cross checking between both data
frame.
New column name: coupon_available
Condition :
item_id in the
transaction data frame and in COUPON data
frame :then the same item_id in a
transaction data frame is discount item / there’s discount
coupon available (as TRUE: described by 1)item_id in the transaction data frame
that IS NOT in COUPON list data frame) as FALSE
(described by 0)transaction$coupon_available <- ifelse(transaction$item_id %in% coupon$item_id, 1 , 0)
transaction[transaction$coupon_discount < 0,]Now as we can see above, we got the information:
when the coupon_discount is minus (coupon discount is
used by customer)
then the used_coupon is used (identifier as TRUE)
because the coupon_available is available (cross check
discount item list from coupon data frame).
And yet there are some inconsistent values in
coupon_available column, where the
coupon_discount is minus but the item_id is
not on the list in coupon data frame (0). Whoever, it’s
more logical to say that customer indeed used the coupon that was why
the coupon_discount is minus, rather than we’re assuming
that the customer did not use the discount just because there’s not
item_id on the coupon data frame. This may caused by the
human error (data imputation process) or maybe the coupon
data frame item list is contain a discount item list in a certain period
of time. Because this problem is not described in the data source, let’s
moving on to our data processing and assume that indeed a customer used
a coupon discount.
Conclusion: replace coupon_available to
1 according to match condition (coupon_discount is
minus)
# Recheck all rows that match to the condition
transaction[transaction$coupon_discount != 0.00 & transaction$coupon_available == 0,]# Replace inconsistent coupon_available column into 1
transaction[transaction$coupon_discount != 0.00 & transaction$coupon_available == 0, "coupon_available"] <- 1
# Recheck the row
transaction[c(113,114, 130, 152, 154),]Now we’re going to create new column to identify a customer that didn’t use the discount coupon even it was available.
New column name: notused_coupon_available
Condition :
coupon_avalaible is 1 (TRUE/AVAILABLE) and
coupon_discount is 0 (as TRUE customer did not use discount
coupon: described by 1)transaction$notused_coupon_available <- ifelse(transaction$coupon_available == 1 & transaction$coupon_discount == 0.00, 1, 0 )
transactionWe’re going to create a new column to identify whether a customer used other discount than a discount coupon to check the frequency of a customer using other discount rather a coupon discount. This idea may help to understand more about our customer behavior.
New column name: used_other_discount
Condition :
other_discount is minus then a customer is using
other_discount (as TRUE customer used other discount:
described by 1)other_discount is 0) as FALSE
(described by 0)transaction$used_other_discount <- ifelse(transaction$other_discount < 0 , 1, 0)
transaction[, c("other_discount","used_other_discount" )]Now we’d like to know about the frequency of using discount coupon from a certain customer. But to get this information, there are more preparations need to be applied to our data frame.
transactionAs I’ve mentioned in the previous sub chapter, we’re going to merge
all transaction according to it’s customer_id to summarize
customer transaction and coupon / discount used, so we can get the
information the frequency of how often a customer is using a discount
coupon.
Therefore in this process we’re going to :
date, and item_id as those column
would not included in further processingcustomer_id data type into factor (to sum all
transaction according to match customer_id)count_trx), because each row is
identifying of 1 transaction, we can fill the count_trx
value to 1 (then it will be summed to each unique id so we get the
information of how many transaction is made by a customer )These adjustments are going to be saved in a new variable called
trx.sum(transaction summary)
trx.sum <- transaction %>%
select(-c(date, item_id)) %>%
mutate(count_trx = 1,
other_discount = str_remove(other_discount, "-"),
coupon_discount = str_remove(coupon_discount, "-"),
customer_id = as.factor(customer_id)) %>%
mutate_at(vars(other_discount, coupon_discount), as.numeric) %>%
group_by(customer_id) %>%
summarise_all(sum)
trx.sumAs we can see above from the trx.sum data frame, we can
see clearer about customer transaction and discount used behavior.
Here the interpretation we can get from trx.sum data
frame:
quantity)selling_price)other_discount)coupon_discount)used_coupon)coupon_available)notused_coupon_available)used_other_discount)count_trx)And yet for this section, it’s more suitable to change the
used_coupon column into used coupon proportion to simplify
our data frame and yet keep the exact data information.
Now let’s change used_coupon column into proportion to
each customer_id transaction:
Steps :
used_coupont value into proportion of
used_coupon compared to coupon_available from
each customer transaction (count into percentage and round the value
into 2 number decimal)coupon_available,
notused_coupon_availabletrx.sum <- trx.sum %>%
mutate(used_coupon = round((used_coupon/coupon_available)*100, 2)) %>%
select(-c(coupon_available, notused_coupon_available))
trx.sumEDA (Exploratory Data Analysis) is the process of performing initial investigations on data so have better understanding to our data, discover patterns, spot anomalies, or check assumptions. Let’s see what kind of information we can get from this process.
To get the glimpse about our customer behavior at using discount coupon let’s see our data distribution from the transaction data frame.
Note: trx.sum and trx.all merged all
transaction from each customer id, so we can use the transaction data
frame
table(transaction$used_coupon) %>%
barplot(names = c("Not Used Coupon", "Used Coupon"),
main = "Coupon Discount Used Frequency",
col = c("#b33126", "#4fc95d"))From all transaction history, the frequency of a customer using a discount coupon is very low
Now let’s check how many discount coupon are available to a specific item :
Let’s count unique item_id from the coupon data frame
and sum them to the item_id from transaction data frame
which are not in the coupon data frame (yet coupon_discount transaction
are minus).
Note we’re not going to count only the coupon_available column in
the transaction data frame to avoid undetected item_id that
is actually discount but none of the customer purchase the item
#count unique discount item_id in coupon data frame
length(unique(coupon$item_id)) + #sum
# Count unique discount item_id that are not in the coupon data frame, and coupon available is TRUE
transaction[!transaction$item_id %in% coupon$item_id & transaction$coupon_available == 1, "item_id" ] %>%
unique() %>%
length() #> [1] 38737
As we can see from above calculation, our data shows that there are 38.737 items that has discount coupon.
Now, let’s check other discount used frequency:
table(transaction$used_other_discount) %>%
barplot(names = c("Not used other discount", "Used other discount"),
main = "Other Discount Used Frequency",
col = c("#b33126", "#4fc95d"))We can conclude that:
counted from each transaction, the frequentcy of a customer using discount coupon is very low, but when using other discount the frequency is slightly higher than the not used other discount. We can assume that in general there’s a tendency of customer is more frequent at using other discount than a coupon discount
Inspect other variable distribution:
par(mfrow = c(2,2))
hist(trx.sum$quantity, labels = T, xlab = NULL, main = "Total Item Purchased Distribution",col = "#49abd0")
abline(v = median(trx.sum$quantity), col = "red")
hist(trx.sum$other_discount, labels = T, xlab = NULL, main = "Total Other Discount Distribution", col = "#bebce3")
abline(v = median(trx.sum$other_discount), col = "red")
hist(trx.sum$coupon_discount, labels = T, xlab = NULL, main = "Total Coupon Discount Distribution", col = "#bebce3")
abline(v = median(trx.sum$coupon_discount), col = "red")
hist(trx.sum$used_coupon, labels = T, xlab = NULL, main = "Discount Coupon Used Percentage Distribution", col = "#49abd0")
abline(v = median(trx.sum$used_coupon), col = "red")
1. Total Item Purchased Distribution Insight
right skewed distribution
2. Total Other Discount Distribution Insight
right skewed distribution
3. Total Coupon Discount Distribution
right skewed distribution
4. Discount Coupon Used Percentage Distribution
right skewed distribution
Customer Transaction Frequency Distribution :
hist(trx.sum$count_trx, main = "Customer Total Transaction Frequency", xlab = "Transaction Count", col = "orange", labels = T)
abline(v = median(trx.sum$count_trx), col = "red")The majority of total transaction frequency from each customer is at the range 0 - 2.000 transactions
The average of total transaction frequency from one customer is approximately at 664 transactions
Now let’s check our variable correlation:
ggcorr(trx.sum, label = T, hjust = 1, layout.exp = 2)There’s high multicolinearity (corelation between variable)
This indeed will happens as some of the new variables are created from the initial data frame.
Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a data set with many variables.
As you’ve noticed our trx.sum data frame has 8 variable,
if we’re going to visualize our customer cluster it’ll produce a
complicated visualization (imagine that 1 variable is defined by 1 axis
of a plot) because we consider all variable is contribute to define a
cluster. To handle this issue we can use PCA to apply the dimension
reduction without changing/loosing our data information. A PC at PCA
will summary some of of our variance variables considering their
correlation. This will help us to reduce our dimension so we can
visualize into 2 dimension plot (from matrix to vector) (2 axis).
Let’s create a PCA object:
Steps
Create PCA object with PCA function
parameter: scale unit set as TRUE
Note that our data has different length of scale (example
selling_price column and used_coupon column,
where selling_price max number can reach unlimited and
used_coupon max number is could only reach to 100)
Change customer.id as row names as it would not be
involved in clustering process (but as the identifier), so that each row
represent a customer transaction
#Set as data frame
trx.sum <- as.data.frame(trx.sum)
#Optional (personal preference) add prefix Id- to customer.id to get clearer information
trx.sum <- trx.sum %>%
mutate(customer_id = paste0("Id-", customer_id ))
#Set customer.id as row names (customer.id as identifier)
rownames(trx.sum) <- trx.sum$customer_id
#Drop column customer.id
trx.sum <- trx.sum %>%
select(-customer_id)
trx.sumquali.sup fill with the index column of our
categorical/factor data typegraph set as FALSE (we’re not going to see
the graph immediately after creating the PCA object or if you’d like to,
you can set as TRUE)ncp is not included. Ncp parameter is to set
the number of dimension we’d like to create. I’m not going to define a
specific number so our data will be reduced to 5 dimensions (5 is the
default number for ncp in PCA function).#Create PCA object
trx.pca <- PCA(
X = trx.sum,
scale.unit = T,
graph = F
)Now our data dimension has been reduce (summarized). Let’s check our PCA object.
trx.pca$var$coord#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
#> quantity 0.6195792 -0.08112585 0.77501704 -0.03739131 0.07066444
#> selling_price 0.9326342 -0.22322195 0.08803342 0.08028152 -0.18967384
#> other_discount 0.9259341 -0.07321597 -0.20712055 -0.07474909 0.26435240
#> coupon_discount 0.6049229 0.72486811 -0.05684311 -0.30211298 -0.11789303
#> used_coupon 0.3008681 0.90824637 0.03714291 0.28358743 0.05227564
#> used_other_discount 0.9428689 -0.16422719 -0.20151272 0.01554260 0.06805377
#> count_trx 0.9323935 -0.24739076 -0.16870217 0.10755580 -0.12895465
As expected, now our data has 5 dimension (summarized from 8 numeric
variable from our trx.sum data frame). From above dimension
description we can see some variables that define our first PC (dim 1)
significantly, those variable are:
selling_price (0.9326342), other_discount
(0.9259341), used_other_discount (0.9428689),
count_trx (0.9323935)Now you may wonder how this number could describe our data? Let’s see the visualization of our data distribution after reshape it into PCA object
# Plotting preparation
#Create individual PCA Plot
plot.ind <- plot.PCA(x = trx.pca,
choix = "ind",
select = "contrib 10" #Showing 10 extreme outliers
)
#Create variable PCA Plot
plot.var <- plot.PCA(trx.pca, choix = "varcor")
#Merge Plot
plot_grid(plot.ind, plot.var, labels = c("A. Individual Plotting", "B. Variable Plotting"))As we can see from above plots, now our data can be distributed into only 2 axes (dim1 as axis x and dim2 as axis y). Notice in both plot has 0 zero line in axis x and axis y, this line will help us to define which variable that contributed to our PC and also our data distribution point. (the direction where the arrow is pointing, means where the variable is most contribute to a PC. In the other words, if the arrow pointing to x axis/parallel with dim 1 axis, means that the particular variable is contributed to PC 1 )
Let’s take an example from the dot Id-1555. How can Id-1555 is at
that certain point? Pay attention to plot B, where
used_other_discount arrow is at, from this position we can
back to the plot A where the ID-1555 at. We can assume that the Id-1555
position is approximately in between used_other_discount
and selling_price. This means that:
used_other_discount define significantly to PC 1count_trx
variable (the value of used_other_discount/
selling_price is considered either very high from the
majority o f population). Let’s compare to other Id, say Id-6 that is
not considered as extreme outlier especially at the
used_other_discount variable#Subset used_other_discount Id-1555 & Id-6 to check
trx.sum[c("Id-1555", "Id-6"), "used_other_discount"]#> [1] 2629 234
#Subset selling_price Id-1555 & Id-6 to check
trx.sum[c("Id-1555", "Id-6"), "selling_price"]#> [1] 518142.14 59508.22
trx.pca$var$coord number to each original value of a
variable from the data frame before we reshape it into PCA object.To get clearer visualization about the variable that significantly contributed to our PC we can see from below plot:
Note: focused to PC 1 & PC 2, as both PCs most describe our data information & and also we’re going to create 2d plot to get the information summary in our data)
fviz_contrib(X = trx.pca,choice = "var",axes = 1, fill = "#a1d2d7")Although there are 7 variables that contributed to PC 1, we can say that variables that significantly contribute to PC 1 is :
used_other_discount,count_trx,selling_price,other_discount(exceed the red line)
fviz_contrib(X = trx.pca,choice = "var",axes = 2, fill = "#c8c7ef")Although there are 7 variables that contributed to PC 2, we can say that variables that significantly contribute to PC 2 is :
used_coupon&coupon_discount(the bar which exceeds the red line)
Now let’s get back to our PCA object:
trx.pca$eig#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 4.32585160 61.7978800 61.79788
#> comp 2 1.50028802 21.4326860 83.23057
#> comp 3 0.72497875 10.3568393 93.58741
#> comp 4 0.19693457 2.8133509 96.40076
#> comp 5 0.14874395 2.1249135 98.52567
#> comp 6 0.08843461 1.2633515 99.78902
#> comp 7 0.01476851 0.2109787 100.00000
From information we get the summary how the variance is cumulative.
As you remember that our PC is formed by the summary of our variable
variable, the trx.pca$eig accumulate the variance from
combined PC. Notice our total variables is 7. If we use all 7 PC we’ll
get the 100% of our information. But when we created our PCA we
only use 5 PC, means that we get the 98.5% of data information.
98.5% is pretty high number for the cumulative percentage of variance
(remember PC summarize our variable information). Again, how much
information we’re keeping is defined by our needs or maybe user
request.
This PCA object can help us to identify some outliers in our data frame. let’s check out individual plot that shows us 10 most extrme outliers:
And the extreme outliers are:
ID : Id-22, Id-70, Id-1333, Id-367, Id-1574, Id-748, Id-1558, Id-464, Id-1555, Id-1011
plot.ind
These extreme outliers will be excluded to our clustering process
(avoiding the bias of the clustering).
K-means is the process to group similar data points together and discover underlying patterns. A cluster refers to a collection of data points aggregated together because of certain similarities. We’ll define a target number k, which refers to the number of centroids we need in the data set. A centroid is the imaginary or real location representing the center of the cluster
To create k-means clustering let’s remove our extreme outliers that appeared on previous PCA Individual plot :
etreme.outliers <- c("Id-22", "Id-70", "Id-1333","Id-367", "Id-1574", "Id-748", "Id-1558", "Id-464", "Id-1555", "Id-1011")
trx.clean <- trx.sum[!rownames(trx.sum) %in% etreme.outliers, ]After we clean our data from 10 most extreme outliers we can define the optimal k (cluster) we may to apply to our data. The number of k can be defined according to the business needs, or in our case will be defined with the statistic calculation using the elbow method.
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. As our objective is to group our customer into some similar characteristic, we can use elbow method to help us determine the optimal number of cluster.
we can use elbow method with fviz_nbclust()
function.
fviz_nbclust(
x = scale(trx.clean),
FUNcluster = kmeans,
method = "wss"
)To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion. Thus for the given data, we can conclude that the optimal number of clusters for the data is 4. Notice that after the number of k is 4 the linear fashion start appearing.
After we’ve decided the number of k(cluster) to our k-means model, let’s create the k-means and inspect the clustering result:
#Create k-means object
RNGkind(sample.kind = "Rounding")
set.seed(78)
trx.cluster <- kmeans(scale(trx.clean), 4)
#Check groups size
trx.cluster$size#> [1] 451 124 920 77
As we’ve segmented our customer with kmeans() function,
now we have 4 type of customer based on their transaction behavior. From
the trx.cluster$size we get the total customer to each
group:
Before we inspect the characteristic of all cluster, let’s see our model performance
$iter tell us how many random steps have
been taken until it find the best version for our data cluster, and for
our case the iteration is 4trx.cluster$iter#> [1] 4
WSS value shows us the the distance of one
component/observation to it;s centroid. The wss value among our clusters
is not so much different. And yet for cluster 3, the wss value is the
lowest, indicating that each observation has more close similarities
than other 3 clusters.
trx.cluster$withinss#> [1] 1048.1855 1133.0947 822.6827 1096.5568
BSS value show us the weighted sum of square distance
each centroid to global average. Our BSS value considered high, means
that each cluster has wide differences of characteristic. That’s a good
point, so each cluster can be differs from another.
trx.cluster$betweenss#> [1] 6896.48
2.BSS / TOTSS (Total of Sum Square)
The value when BSS / TOTTS shows us that each cluster is good at describing the dominate characteristic in it’s cluster and good at describing the our data distribution. The closer the value to 1, the better our model. Our BSS/TOTSS value is 0.62, it’s quite good.
trx.cluster$betweenss/
trx.cluster$totss#> [1] 0.6271238
After we inspect our model, now we can proceed to see our data clustering. In this section we’re going to see each cluster characteristic.
Steps
trx.sum data
frame# Pur clustering result into trx.sum data frame
trx.clean$clustering <- as.factor(trx.cluster$cluster)
#see the example of customer clustering result
trx.clean[1:6, 6:8]# Group and merge each cluster to average variables
profile.average <- trx.clean %>%
group_by(clustering) %>%
summarise_all(mean) %>%
select(clustering, used_coupon, coupon_discount, used_other_discount, count_trx, everything())
profile.averageClustering Profile
1. Cluster 1
count_trx). Counted transaction is at
2nd rank highest, we can assume that cluster 1 is active at
transactionCluster 1 Profile Highlight: Considered high rate at transaction. There’s a chance of using discount even the discount value is low.
2. Cluster 2
quantity) and spending
(selling_price)Cluster 2 Profile Highlight: Highest at counted
transaction,quantity, andselling_price. Prefer to use other discount. Can be assumed there’s a tendency of using coupon discount when the offer value of discount is high
3. Cluster 3
trx.clean %>%
filter(coupon_discount == 0) %>%
select(clustering) %>%
group_by(clustering) %>%
table()#> clustering
#> 1 2 3 4
#> 37 3 292 0
coupon_disount variable there are 292 customers from
cluster 3 that never used coupon even the coupon discount is available.
In the other word, the cluster 3 describe the group of customer that
never used the discount coupon to rarely use coupon.Cluster 3 Profile Highlight: Lowest rank on every aspect, the frequent of using discount coupon is never to rarely
4. Cluster 4
count_trx & selling_price is close to
cluster 1, but quantity is twice higher than cluster
1)Cluster 4 Profile Highlight: High rate at transaction. The most frequent of using coupon discount. Interested in discount coupon and other discount. Transacction behavior : purchase >1 items/transaction
We can also get the same information from below visualization:
scale.profile <- profile.average %>%
mutate_if(is.numeric, scale)
scale.profile %>%
pivot_longer(-clustering) %>%
ggplot(aes(x = clustering, y = value, fill = clustering)) +
geom_col() +
facet_wrap(~name)freq.clus <- trx.clean %>%
select(clustering) %>%
table() %>% as.data.frame()
ggplot(freq.clus, aes(x= clustering, y = Freq)) +
geom_col(fill = c("#5c88bd", "#bfe0c9", "#f8998f", "#bdaed1"))+
geom_text(aes(label = Freq, group = clustering), hjust = .5, vjust = -.5, fontface = "bold") +
theme_minimal() +
labs(x = "Cluster",
y = "Frequency",
title = "Cluster Frequency Distribution") +
theme(plot.title = element_text(size = 18, face = "bold", family = "serif"),
axis.title.y = element_text(size = 13, hjust = 0),
axis.text = element_text(size = 12))From the analysis and also above plot we can conclude:
Conclusion
It’s inefficient to use coupon discount to increase the transaction rate to our customer. And yet if the coupon discount campaign is needed, we can use distribute discount offer strategy to each cluster