First, we need to filter out rows with R_trans == Inf
and then scale the features to have zero mean and unit variance.
RFM_score_t_filtered <- RFM_score_t[RFM_score_t$R_trans != Inf,]
data_t <- RFM_score_t_filtered[, c("R_trans", "F_trans", "M_trans")]
data_t_normalized <- scale(data_t)
kWe can determine the optimal number of clusters in clustering by 2 most popular methods: Elbow method and Silhouette Method.
We will first compute the total within-cluster sum of square,
wss, and then plot the Elbow method.
Result: The plot of Elbow method below indicates
that k = 3 is an optimal number of clusters.
wss <- sapply(1:10, function(k) kmeans(data_t_normalized, k)$tot.withinss)
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-Cluster Sum of Squares")
We can use vectorized operations to compute the Silhouette scores for
different values of k. We will implement
fviz_nbclust function here to compute the average
Silhouette method without a for loop. To avoid running into
memory limitations due to the large size of the dataset
data_t_normalized, we will choose to sample only a subset
of the data (i.e. 10,000).
Result: The plot of Silhouette method below
indicates that k = 4 is an optimal number of clusters.
set.seed(0)
sample_data <- data_t_normalized[sample(nrow(data_t_normalized), 10000),]
fviz_nbclust(sample_data, FUN = kmeans, method = "silhouette")
As a compromise, we will pick k = 4 as the initial
optimal number of clusters to perform the following clustering
algorithms. The reason is fairly simple as 4 clusters may speak louder
about the large data structure than 3 clusters. Also, we will still
implement when k = 3 or when k = 5 for us to
compare within-cluster sum of squares (WCSS) later.
We will first implement k = 4.
set.seed(0)
kmeans_4 <- kmeans(data_t_normalized, centers = 4)
RFM_score_t_filtered$cluster <- as.factor(kmeans_4$cluster)
The 3D scatterplot with 3 dimensions of R,
F, and M in transaction has been plotted
below. The pattern is very clear:
For low M in transaction amount (roughly less than
$3,000 in M_trans), there are 3 clusters separated mainly
by R_trans.
For high M in transaction amount (roughly above than
$3,000 in M_trans), there is only 1 cluster that is
scattered everywhere with no particular patterns.
marker_size <- 3
plot_ly(RFM_score_t_filtered,
x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
add_markers(marker = list(size = marker_size)) |>
layout(scene = list(xaxis = list(title = "R_trans"),
yaxis = list(title = "F_trans"),
zaxis = list(title = "M_trans")),
title = "3D Scatterplot for RFM with K-Means (k = 4)")
We should still assess the quality of the clusters using WCSS. A
lower WCSS indicates that the observations within each cluster are
closer to the centroids of their respective clusters. Hence, under
comprehensive thinking, we will choose k = 5 since our
large data structure needs more clusters to clearly distinguish
different segments.
kmeans_3 <- kmeans(data_t_normalized, centers = 3)
wcss_3 <- kmeans_3$tot.withinss
wcss_3
## [1] 119561.6
wcss_4 <- kmeans_4$tot.withinss
wcss_4
## [1] 94453.18
kmeans_5 <- kmeans(data_t_normalized, centers = 5)
wcss_5 <- kmeans_5$tot.withinss
wcss_5
## [1] 77691.76
We will now implement k = 5 and visualize its
clusters.
set.seed(0)
kmeans_5 <- kmeans(data_t_normalized, centers = 5)
RFM_score_t_filtered$cluster <- as.factor(kmeans_5$cluster)
plot_ly(RFM_score_t_filtered,
x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
add_markers(marker = list(size = marker_size)) |>
layout(scene = list(xaxis = list(title = "R_trans"),
yaxis = list(title = "F_trans"),
zaxis = list(title = "M_trans")),
title = "3D Scatterplot for RFM with K-Means (k = 5)")
We will follow the subsequent points to interpret these 5 clusters in context of our data.
When k = 5, the 3D scatterplot with 3 dimensions of
R, F, and M in transaction has a
quite clear pattern as well:
For low M in transaction amount (roughly less than
$3,000 in M_trans), there are 3 clusters separated mainly
by R_trans.
For high M in transaction amount (roughly above than
$3,000 in M_trans), there are 2 clusters.
To summarize,
Cluster 1 has high M and high F.
R is more sporadic for cluster 1, though more data points
under cluster 1 tend to have small R (more recently
active).
Cluster 2 has relatively low M and low
F. R is something in the middle.
Cluster 3 has relatively low M (with a few
exceptions of high M), low R, and low
F.
Cluster 4 has very high M. However, the patterns for
R and F are scattered everywhere.
Cluster 5 has relatively low M (with a few
exceptions of high M), very high R (customers
who are very inactive), and relatively low F.
We will examine the centroids of each cluster, where the centroids represent the “average” data point in each cluster.
cluster_centroids <- kmeans_5$centers
cluster_centroids
## R_trans F_trans M_trans
## 1 -0.24681971 2.6570477 1.0124334
## 2 0.26101479 -0.1577517 -0.1401386
## 3 -0.83656214 -0.2046933 -0.1593184
## 4 -0.05512244 2.8036168 10.3983305
## 5 1.64950171 -0.1205759 0.0284700
Then we profile each cluster by examining additional summary
statistics of RFM scores within each cluster. 4 statistical measures
here include the average value of R_trans,
F_trans, M_trans, and the range of
R_trans.
RFM_score_t_filtered$cluster <- as.factor(kmeans_5$cluster)
cluster_profiles <- RFM_score_t_filtered |>
group_by(cluster) |>
summarise(
R_trans_mean = mean(R_trans),
F_trans_mean = mean(F_trans),
M_trans_mean = mean(M_trans),
R_trans_range = paste(min(R_trans), max(R_trans), sep = "-"),
)
cluster_profiles
## # A tibble: 5 × 5
## cluster R_trans_mean F_trans_mean M_trans_mean R_trans_range
## <fct> <dbl> <dbl> <dbl> <chr>
## 1 1 186. 3.69 1214. 0-743
## 2 2 291. 1.22 208. 166-442
## 3 3 65.4 1.18 191. 0-180
## 4 4 226. 3.82 9410. 1-731
## 5 5 576. 1.25 355. 342-760
Centroids: R_trans = -0.2468,
F_trans = 2.6570, M_trans = 1.0124
Profiles: R_trans: [0, 743],
Frequent transactions, High monetary value
Interpretation: This cluster appears to
represent “Champions” customers. They are engaged,
frequently transacting, and contribute significantly to revenue. They
may also have been recently active, although R_trans is
more sporadic.
Centroids: R_trans = 0.2610,
F_trans = -0.1578, M_trans = -0.1401
Profiles: R_trans: [166, 442], Less
frequent, Lower monetary value
Interpretation: This cluster might represent “About To Sleep” customers. They are less active, less frequent, and contribute less to revenue.
Centroids: R_trans = -0.8366,
F_trans = -0.2047, M_trans = -0.1593
Profiles: R_trans: [0, 180], Low
recency, Less frequent, Lower monetary value
Interpretation: This cluster might align with “New Customers” segments. They have shown recent activity but have low frequency and monetary value. They may be new to inKind and have potential for growth.
Centroids: R_trans = -0.0551,
F_trans = 2.8036, M_trans = 10.3983
Profiles: R_trans: [1, 731], Very
high monetary value, Frequent transactions
Interpretation: This unique cluster likely
represents a segment like “Super Champions” or
extremely high-value customers. They spend significantly more and
transact more frequently. R_trans is scattered, indicating
varied recency.
Centroids: R_trans = 1.6495,
F_trans = -0.1206, M_trans = 0.0285
Profiles: R_trans: [342, 760], Very
high recency, Less frequent, Moderate monetary value
Interpretation: This cluster might represent customers “At Risk”. They have not been active for a long time, have low frequency, and moderate monetary value.
These interpretations provide a qualitative understanding of the different customer behaviors captured by these 5 clusters.
In this sub-section, we will recreate the 3D Scatterplot with the
updated cluster labels for K-Means where k = 5.
RFM_score_t_filtered$cluster <- factor(RFM_score_t_filtered$cluster,
levels = 1:5,
labels = c("Champions",
"About To Sleep",
"New Customers",
"Super Champions",
"At Risk"))
plot_ly(RFM_score_t_filtered,
x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
add_markers(marker = list(size = marker_size)) |>
layout(scene = list(xaxis = list(title = "R_trans"),
yaxis = list(title = "F_trans"),
zaxis = list(title = "M_trans")),
title = "3D Scatterplot for RFM with K-Means (k = 5)")
kmeans_cluster_count <- table(RFM_score_t_filtered$cluster)
kmeans_cluster_count <- as.data.frame(kmeans_cluster_count)
colnames(kmeans_cluster_count) <- c("cluster", "count")
kmeans_cluster_count <- kmeans_cluster_count |>
mutate(percentage = round(count / sum(count) * 100, 2))
plot_ly(kmeans_cluster_count, ids = ~cluster, values = ~count, labels = ~cluster,
textinfo = "label+percent", insidetextorientation = "radial") |>
add_pie(hole = 0.6) |>
layout(showlegend = TRUE,
annotations = list(text = "K-Means Clustering \n (k = 5)",
font = list(size = 20),
showarrow = FALSE),
legend = list(orientation = "h", x = 0.5, y = -0.1, xanchor = "center"))
Though K-Means clustering is doing a fairly good job, it assumes clusters as spherical and equally sized. It might struggle with clusters that have non-spherical or elliptical shapes. It provides hard cluster assignments, where each data point is assigned to a single cluster.
In our large-scale Bayesian probabilistic modeling work in NumPyro (upcoming in Statistical Modeling V), GMM’s flexibility and probabilistic nature could be advantageous for capturing intricate data patterns. In particular, GMM, as soft cluster assignments, provides a probabilistic interpretation of cluster assignments, indicating the likelihood of a data point belonging to a particular cluster. This accounts for uncertainty in assignments.
First, we scale these 3 dimensions so that they have zero mean and unit variance.
scaled_R_trans <- scale(RFM_score_t_filtered$R_trans)
scaled_F_trans <- scale(RFM_score_t_filtered$F_trans)
scaled_M_trans <- scale(RFM_score_t_filtered$M_trans)
scaled_RFM_score_t <- data.frame(R_trans = scaled_R_trans,
F_trans = scaled_F_trans,
M_trans = scaled_M_trans)
Then we want to measure the optimal number of components for GMM.
Since our dataset RFM_score_t_filtered is fairly large,
we consider reducing its dimensionality using PCA first.
RFM_score_t_filtered_scaled_pca <- prcomp(scaled_RFM_score_t[, c("R_trans", "F_trans", "M_trans")])
RFM_score_t_filtered_scaled_reduced <- RFM_score_t_filtered_scaled_pca$x[, 1:2]
We can implement BIC to find the optimal number of components. Lower
BIC values indicate better models. optimal_clusters
indicates that the optimal number of clusters is 3, which is roughly
aligned with our initialization in K-Means.
set.seed(0)
BIC_values <- mclustBIC(RFM_score_t_filtered_scaled_reduced)
which.min(BIC_values)
## [1] 3
Then we fit the GMM to RFM_score_t_filtered by using the
optimal number of clusters, G = 3.
set.seed(0)
RFM_gmm <- Mclust(RFM_score_t_filtered[, c("R_trans", "F_trans", "M_trans")], G = 3)
RFM_score_t_filtered$cluster <- factor(RFM_gmm$classification)
plot_ly(RFM_score_t_filtered,
x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
add_markers(marker = list(size = marker_size)) |>
layout(scene = list(xaxis = list(title = "R_trans"),
yaxis = list(title = "F_trans"),
zaxis = list(title = "M_trans")),
title = "3D Scatterplot for RFM with GMM (G = 3)")
The summary of our fitted GMM with G = 3 is shown below.
We are interested in BIC and ICL (Integrated Complete-data
Likelihood).
summary(RFM_gmm)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 3 components:
##
## log-likelihood n df BIC ICL
## -1070321 76734 25 -2140923 -2144209
##
## Clustering table:
## 1 2 3
## 48681 20797 7256
Let’s also take a look at our fitted GMM with G = 4.
set.seed(0)
RFM_gmm_4 <- Mclust(RFM_score_t_filtered[, c("R_trans", "F_trans", "M_trans")], G = 4)
summary(RFM_gmm_4)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEV (ellipsoidal, equal volume and shape) model with 4 components:
##
## log-likelihood n df BIC ICL
## -1181834 76734 30 -2364005 -2392981
##
## Clustering table:
## 1 2 3 4
## 2480 21335 52564 355
In summary,
G = 3: BIC = -2,140,923, ICL = -2,144,209
G = 4: BIC = -2,364,005, ICL = -2,392,981
Since both the BIC and ICL values are lower for G = 4,
this suggests that the GMM with 4 clusters might provide a better fit to
the data. The ICL additionally takes into account the uncertainty of the
clustering, and since it also indicates that G = 4 is
better, it further supports this conclusion.
Therefore, let’s recreate the 3D Scatterplot for GMM with
G = 4.
RFM_score_t_filtered$cluster <- factor(RFM_gmm_4$classification)
plot_ly(RFM_score_t_filtered,
x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
add_markers(marker = list(size = marker_size)) |>
layout(scene = list(xaxis = list(title = "R_trans"),
yaxis = list(title = "F_trans"),
zaxis = list(title = "M_trans")),
title = "3D Scatterplot for RFM with GMM (G = 4)")
We will follow the subsequent points to interpret these 4 clusters in context of our data.
The pattern above is not very clear compared to K-Means Clustering
when k = 5. To summarize what I can read from the 3D
Scatterplot,
Cluster 1 has quite high F, meaning that they
purchase very frequently, while R and M
vary.
Cluster 2 has relatively low M and low
F. However, its range is spanning everywhere in
R.
Cluster 3 has low R and relatively low
M, meaning that they are most recently active customers but
purchase amount is not significantly high. Their F is
mostly not too high, though a few sporadic high F exist in
cluster 3.
Cluster 4 shares very similar structure as cluster 3, but it
seems that cluster 4 is even more extreme than cluster 3. In other
words, we can say that cluster 4 has no particular patterns. Their
R, F, and M ranges are much more
volatile.
Similar to K-Means Clustering, we profile each cluster by examining
additional summary statistics of RFM scores within each cluster. 4
statistical measures here include the average value of
R_trans, F_trans, M_trans, and
the range of R_trans.
cluster_profiles <- RFM_score_t_filtered |> group_by(cluster) |>
summarise(
R_trans_mean = mean(R_trans),
F_trans_mean = mean(F_trans),
M_trans_mean = mean(M_trans),
R_trans_range = paste(min(R_trans), max(R_trans), sep = "-"),
)
cluster_profiles
## # A tibble: 4 × 5
## cluster R_trans_mean F_trans_mean M_trans_mean R_trans_range
## <fct> <dbl> <dbl> <dbl> <chr>
## 1 1 156. 4.12 690. 0-730
## 2 2 517. 1.27 403. 1-760
## 3 3 127. 1.25 221. 0-664
## 4 4 232. 2.94 9492. 1-731
Profile: R_trans_mean: 155.9044,
F_trans_mean: 4.1157, M_trans_mean: 690.2089,
R_trans_range: [0, 730]
Interpretation: This cluster represents
customers who purchase very frequently (high F_trans). The
recent activity (R_trans) and monetary value
(M_trans) are varied but on average are not particularly
high or low. These customers may be loyal and engaged but not
necessarily
Profile: R_trans_mean: 517.0262,
F_trans_mean: 1.2655, M_trans_mean: 403.2855,
R_trans_range: [1, 760]
Interpretation: This cluster represents
customers who are less recently active (high R_trans) and
have low frequency (F_trans). Their spending
(M_trans) is moderate. These customers may be at risk of
churning or becoming inactive.
Profile: R_trans_mean: 127.3497,
F_trans_mean: 1.2523, M_trans_mean: 221.4865,
R_trans_range: [0, 664]
Interpretation: This cluster includes customers
who are recently active (low R_trans) but have relatively
low frequency (F_trans) and monetary value
(M_trans). These may be newer customers or those who are
engaged but spend less.
Profile: R_trans_mean: 231.7296,
F_trans_mean: 2.9380, M_trans_mean: 9491.8101,
R_trans_range: [1, 731]
Interpretation: This cluster has a very high
monetary value (M_trans), with scattered values for recent
activity (R_trans) and frequency (F_trans).
These customers may be high-value or big spenders but with inconsistent
engagement patterns.
RFM_score_t_filtered$cluster <- factor(RFM_gmm_4$classification,
levels = 1:4,
labels = c("Frequent Purchasers",
"Inactive Moderate Spenders",
"Recently Active Low Spenders",
"Volatile High Spenders"))
plot_ly(RFM_score_t_filtered,
x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
add_markers(marker = list(size = marker_size)) |>
layout(scene = list(xaxis = list(title = "R_trans"),
yaxis = list(title = "F_trans"),
zaxis = list(title = "M_trans")),
title = "3D Scatterplot for RFM with GMM (G = 4)")
gmm_cluster_count <- table(RFM_score_t_filtered$cluster)
gmm_cluster_count <- as.data.frame(gmm_cluster_count)
colnames(gmm_cluster_count) <- c("cluster", "count")
gmm_cluster_count <- gmm_cluster_count |>
mutate(percentage = round(count / sum(count) * 100, 2))
plot_ly(gmm_cluster_count, ids = ~cluster, values = ~count, labels = ~cluster,
textinfo = "label+percent", insidetextorientation = "radial") |>
add_pie(hole = 0.5) |>
layout(showlegend = TRUE,
annotations = list(text = "GMM Clustering \n (G = 4)",
font = list(size = 20),
showarrow = FALSE),
legend = list(orientation = "h", x = 0.5, y = -0.1, xanchor = "center"))