1 RFM with K-Means Clustering

1.1 Data Normalization

First, we need to filter out rows with R_trans == Inf and then scale the features to have zero mean and unit variance.

RFM_score_t_filtered <- RFM_score_t[RFM_score_t$R_trans != Inf,]

data_t <- RFM_score_t_filtered[, c("R_trans", "F_trans", "M_trans")]

data_t_normalized <- scale(data_t)

1.2 Determining Optimal Number of Clusters `k`

We can determine the optimal number of clusters in clustering by 2 most popular methods: Elbow method and Silhouette Method.

1.2.1 Elbow Method

We will first compute the total within-cluster sum of square, wss, and then plot the Elbow method.

Result: The plot of Elbow method below indicates that k = 3 is an optimal number of clusters.

wss <- sapply(1:10, function(k) kmeans(data_t_normalized, k)$tot.withinss)

plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-Cluster Sum of Squares")

1.2.2 Silhouette Method

We can use vectorized operations to compute the Silhouette scores for different values of k. We will implement fviz_nbclust function here to compute the average Silhouette method without a for loop. To avoid running into memory limitations due to the large size of the dataset data_t_normalized, we will choose to sample only a subset of the data (i.e. 10,000).

Result: The plot of Silhouette method below indicates that k = 4 is an optimal number of clusters.

set.seed(0)

sample_data <- data_t_normalized[sample(nrow(data_t_normalized), 10000),]

fviz_nbclust(sample_data, FUN = kmeans, method = "silhouette")

1.2.3 Initialization

As a compromise, we will pick k = 4 as the initial optimal number of clusters to perform the following clustering algorithms. The reason is fairly simple as 4 clusters may speak louder about the large data structure than 3 clusters. Also, we will still implement when k = 3 or when k = 5 for us to compare within-cluster sum of squares (WCSS) later.

1.3 K-Means Algorithm Implementation

We will first implement k = 4.

set.seed(0)

kmeans_4 <- kmeans(data_t_normalized, centers = 4) 

RFM_score_t_filtered$cluster <- as.factor(kmeans_4$cluster)

1.4 Cluster Visualization

The 3D scatterplot with 3 dimensions of R, F, and M in transaction has been plotted below. The pattern is very clear:

For low M in transaction amount (roughly less than $3,000 in M_trans), there are 3 clusters separated mainly by R_trans.
For high M in transaction amount (roughly above than $3,000 in M_trans), there is only 1 cluster that is scattered everywhere with no particular patterns.

marker_size <- 3

plot_ly(RFM_score_t_filtered, 
        x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
  add_markers(marker = list(size = marker_size)) |>
  layout(scene = list(xaxis = list(title = "R_trans"),
                      yaxis = list(title = "F_trans"),
                      zaxis = list(title = "M_trans")),
         title = "3D Scatterplot for RFM with K-Means (k = 4)")

1.5 Cluster Evaluation

We should still assess the quality of the clusters using WCSS. A lower WCSS indicates that the observations within each cluster are closer to the centroids of their respective clusters. Hence, under comprehensive thinking, we will choose k = 5 since our large data structure needs more clusters to clearly distinguish different segments.

kmeans_3 <- kmeans(data_t_normalized, centers = 3) 

wcss_3 <- kmeans_3$tot.withinss

wcss_3

## [1] 119561.6

wcss_4 <- kmeans_4$tot.withinss

wcss_4

## [1] 94453.18

kmeans_5 <- kmeans(data_t_normalized, centers = 5) 

wcss_5 <- kmeans_5$tot.withinss

wcss_5

## [1] 77691.76

1.6 K-Means Algorithm Re-implementation

We will now implement k = 5 and visualize its clusters.

set.seed(0)

kmeans_5 <- kmeans(data_t_normalized, centers = 5) 

RFM_score_t_filtered$cluster <- as.factor(kmeans_5$cluster)

plot_ly(RFM_score_t_filtered, 
        x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
  add_markers(marker = list(size = marker_size)) |>
  layout(scene = list(xaxis = list(title = "R_trans"),
                      yaxis = list(title = "F_trans"),
                      zaxis = list(title = "M_trans")),
         title = "3D Scatterplot for RFM with K-Means (k = 5)")

2 K-Means Cluster Interpretation

We will follow the subsequent points to interpret these 5 clusters in context of our data.

2.1 Graphic Verbal Interpretation

When k = 5, the 3D scatterplot with 3 dimensions of R, F, and M in transaction has a quite clear pattern as well:

For low M in transaction amount (roughly less than $3,000 in M_trans), there are 3 clusters separated mainly by R_trans.
For high M in transaction amount (roughly above than $3,000 in M_trans), there are 2 clusters.

To summarize,

Cluster 1 has high M and high F. R is more sporadic for cluster 1, though more data points under cluster 1 tend to have small R (more recently active).
Cluster 2 has relatively low M and low F. R is something in the middle.
Cluster 3 has relatively low M (with a few exceptions of high M), low R, and low F.
Cluster 4 has very high M. However, the patterns for R and F are scattered everywhere.
Cluster 5 has relatively low M (with a few exceptions of high M), very high R (customers who are very inactive), and relatively low F.

2.2 Cluster Centroids

We will examine the centroids of each cluster, where the centroids represent the “average” data point in each cluster.

cluster_centroids <- kmeans_5$centers

cluster_centroids

##       R_trans    F_trans    M_trans
## 1 -0.24681971  2.6570477  1.0124334
## 2  0.26101479 -0.1577517 -0.1401386
## 3 -0.83656214 -0.2046933 -0.1593184
## 4 -0.05512244  2.8036168 10.3983305
## 5  1.64950171 -0.1205759  0.0284700

2.3 Cluster Profiling

Then we profile each cluster by examining additional summary statistics of RFM scores within each cluster. 4 statistical measures here include the average value of R_trans, F_trans, M_trans, and the range of R_trans.

RFM_score_t_filtered$cluster <- as.factor(kmeans_5$cluster)

cluster_profiles <- RFM_score_t_filtered |>
  group_by(cluster) |>
  summarise(
    R_trans_mean = mean(R_trans),
    F_trans_mean = mean(F_trans),
    M_trans_mean = mean(M_trans),
    R_trans_range = paste(min(R_trans), max(R_trans), sep = "-"),
  )

cluster_profiles

## # A tibble: 5 × 5
##   cluster R_trans_mean F_trans_mean M_trans_mean R_trans_range
##   <fct>          <dbl>        <dbl>        <dbl> <chr>        
## 1 1              186.          3.69        1214. 0-743        
## 2 2              291.          1.22         208. 166-442      
## 3 3               65.4         1.18         191. 0-180        
## 4 4              226.          3.82        9410. 1-731        
## 5 5              576.          1.25         355. 342-760

2.4 Summary

2.4.1 Cluster 1: Champions

Centroids: R_trans = -0.2468, F_trans = 2.6570, M_trans = 1.0124
Profiles: R_trans: [0, 743], Frequent transactions, High monetary value
Interpretation: This cluster appears to represent “Champions” customers. They are engaged, frequently transacting, and contribute significantly to revenue. They may also have been recently active, although R_trans is more sporadic.

2.4.2 Cluster 2: About To Sleep

Centroids: R_trans = 0.2610, F_trans = -0.1578, M_trans = -0.1401
Profiles: R_trans: [166, 442], Less frequent, Lower monetary value
Interpretation: This cluster might represent “About To Sleep” customers. They are less active, less frequent, and contribute less to revenue.

2.4.3 Cluster 3: New Customers

Centroids: R_trans = -0.8366, F_trans = -0.2047, M_trans = -0.1593
Profiles: R_trans: [0, 180], Low recency, Less frequent, Lower monetary value
Interpretation: This cluster might align with “New Customers” segments. They have shown recent activity but have low frequency and monetary value. They may be new to inKind and have potential for growth.

2.4.4 Cluster 4: Super Champions

Centroids: R_trans = -0.0551, F_trans = 2.8036, M_trans = 10.3983
Profiles: R_trans: [1, 731], Very high monetary value, Frequent transactions
Interpretation: This unique cluster likely represents a segment like “Super Champions” or extremely high-value customers. They spend significantly more and transact more frequently. R_trans is scattered, indicating varied recency.

2.4.5 Cluster 5: At Risk

Centroids: R_trans = 1.6495, F_trans = -0.1206, M_trans = 0.0285
Profiles: R_trans: [342, 760], Very high recency, Less frequent, Moderate monetary value
Interpretation: This cluster might represent customers “At Risk”. They have not been active for a long time, have low frequency, and moderate monetary value.

These interpretations provide a qualitative understanding of the different customer behaviors captured by these 5 clusters.

2.5 Final Visualization with K-Means

In this sub-section, we will recreate the 3D Scatterplot with the updated cluster labels for K-Means where k = 5.

RFM_score_t_filtered$cluster <- factor(RFM_score_t_filtered$cluster,
                                       levels = 1:5,
                                       labels = c("Champions", 
                                                  "About To Sleep", 
                                                  "New Customers", 
                                                  "Super Champions", 
                                                  "At Risk"))

plot_ly(RFM_score_t_filtered, 
        x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
  add_markers(marker = list(size = marker_size)) |>
  layout(scene = list(xaxis = list(title = "R_trans"),
                      yaxis = list(title = "F_trans"),
                      zaxis = list(title = "M_trans")),
         title = "3D Scatterplot for RFM with K-Means (k = 5)")

2.6 Distribution

kmeans_cluster_count <- table(RFM_score_t_filtered$cluster)

kmeans_cluster_count <- as.data.frame(kmeans_cluster_count)

colnames(kmeans_cluster_count) <- c("cluster", "count")

kmeans_cluster_count <- kmeans_cluster_count |>
  mutate(percentage = round(count / sum(count) * 100, 2))

plot_ly(kmeans_cluster_count, ids = ~cluster, values = ~count, labels = ~cluster, 
                textinfo = "label+percent", insidetextorientation = "radial") |>
  add_pie(hole = 0.6) |> 
  layout(showlegend = TRUE,
         annotations = list(text = "K-Means Clustering \n (k = 5)",
                            font = list(size = 20),
                            showarrow = FALSE),
         legend = list(orientation = "h", x = 0.5, y = -0.1, xanchor = "center"))

3 RFM with Gaussian Mixture Model (GMM)

Though K-Means clustering is doing a fairly good job, it assumes clusters as spherical and equally sized. It might struggle with clusters that have non-spherical or elliptical shapes. It provides hard cluster assignments, where each data point is assigned to a single cluster.

In our large-scale Bayesian probabilistic modeling work in NumPyro (upcoming in Statistical Modeling V), GMM’s flexibility and probabilistic nature could be advantageous for capturing intricate data patterns. In particular, GMM, as soft cluster assignments, provides a probabilistic interpretation of cluster assignments, indicating the likelihood of a data point belonging to a particular cluster. This accounts for uncertainty in assignments.

3.1 Data Normalization

First, we scale these 3 dimensions so that they have zero mean and unit variance.

scaled_R_trans <- scale(RFM_score_t_filtered$R_trans)

scaled_F_trans <- scale(RFM_score_t_filtered$F_trans)

scaled_M_trans <- scale(RFM_score_t_filtered$M_trans)

scaled_RFM_score_t <- data.frame(R_trans = scaled_R_trans, 
                                 F_trans = scaled_F_trans, 
                                 M_trans = scaled_M_trans)

3.2 Determining Optimal Number of Components / Clusters

Then we want to measure the optimal number of components for GMM.

3.2.1 Dimensionality Reducation by Principal Component Analysis (PCA)

Since our dataset RFM_score_t_filtered is fairly large, we consider reducing its dimensionality using PCA first.

RFM_score_t_filtered_scaled_pca <- prcomp(scaled_RFM_score_t[, c("R_trans", "F_trans", "M_trans")])

RFM_score_t_filtered_scaled_reduced <- RFM_score_t_filtered_scaled_pca$x[, 1:2]

3.2.2 Optimal Number of Clusters by Bayesian Information Criterion (BIC)

We can implement BIC to find the optimal number of components. Lower BIC values indicate better models. optimal_clusters indicates that the optimal number of clusters is 3, which is roughly aligned with our initialization in K-Means.

set.seed(0)

BIC_values <- mclustBIC(RFM_score_t_filtered_scaled_reduced)

which.min(BIC_values)

## [1] 3

3.3 GMM Algorithm Implementation

Then we fit the GMM to RFM_score_t_filtered by using the optimal number of clusters, G = 3.

set.seed(0)

RFM_gmm <- Mclust(RFM_score_t_filtered[, c("R_trans", "F_trans", "M_trans")], G = 3)

3.4 Cluster Visualization

RFM_score_t_filtered$cluster <- factor(RFM_gmm$classification)

plot_ly(RFM_score_t_filtered, 
        x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
  add_markers(marker = list(size = marker_size)) |>
  layout(scene = list(xaxis = list(title = "R_trans"),
                      yaxis = list(title = "F_trans"),
                      zaxis = list(title = "M_trans")),
         title = "3D Scatterplot for RFM with GMM (G = 3)")

3.5 Cluster Evaluation

The summary of our fitted GMM with G = 3 is shown below. We are interested in BIC and ICL (Integrated Complete-data Likelihood).

summary(RFM_gmm)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components: 
## 
##  log-likelihood     n df      BIC      ICL
##        -1070321 76734 25 -2140923 -2144209
## 
## Clustering table:
##     1     2     3 
## 48681 20797  7256

Let’s also take a look at our fitted GMM with G = 4.

set.seed(0)

RFM_gmm_4 <- Mclust(RFM_score_t_filtered[, c("R_trans", "F_trans", "M_trans")], G = 4)

summary(RFM_gmm_4)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 4 components: 
## 
##  log-likelihood     n df      BIC      ICL
##        -1181834 76734 30 -2364005 -2392981
## 
## Clustering table:
##     1     2     3     4 
##  2480 21335 52564   355

In summary,

G = 3: BIC = -2,140,923, ICL = -2,144,209
G = 4: BIC = -2,364,005, ICL = -2,392,981

Since both the BIC and ICL values are lower for G = 4, this suggests that the GMM with 4 clusters might provide a better fit to the data. The ICL additionally takes into account the uncertainty of the clustering, and since it also indicates that G = 4 is better, it further supports this conclusion.

Therefore, let’s recreate the 3D Scatterplot for GMM with G = 4.

3.6 GMM Algorithm Re-Implementation

RFM_score_t_filtered$cluster <- factor(RFM_gmm_4$classification)

plot_ly(RFM_score_t_filtered, 
        x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
  add_markers(marker = list(size = marker_size)) |>
  layout(scene = list(xaxis = list(title = "R_trans"),
                      yaxis = list(title = "F_trans"),
                      zaxis = list(title = "M_trans")),
         title = "3D Scatterplot for RFM with GMM (G = 4)")

4 GMM Cluster Interpretation

We will follow the subsequent points to interpret these 4 clusters in context of our data.

4.1 Graphic Verbal Interpretation

The pattern above is not very clear compared to K-Means Clustering when k = 5. To summarize what I can read from the 3D Scatterplot,

Cluster 1 has quite high F, meaning that they purchase very frequently, while R and M vary.
Cluster 2 has relatively low M and low F. However, its range is spanning everywhere in R.
Cluster 3 has low R and relatively low M, meaning that they are most recently active customers but purchase amount is not significantly high. Their F is mostly not too high, though a few sporadic high F exist in cluster 3.
Cluster 4 shares very similar structure as cluster 3, but it seems that cluster 4 is even more extreme than cluster 3. In other words, we can say that cluster 4 has no particular patterns. Their R, F, and M ranges are much more volatile.

4.2 Cluster Profiling

Similar to K-Means Clustering, we profile each cluster by examining additional summary statistics of RFM scores within each cluster. 4 statistical measures here include the average value of R_trans, F_trans, M_trans, and the range of R_trans.

cluster_profiles <- RFM_score_t_filtered |> group_by(cluster) |>
  summarise(
    R_trans_mean = mean(R_trans),
    F_trans_mean = mean(F_trans),
    M_trans_mean = mean(M_trans),
    R_trans_range = paste(min(R_trans), max(R_trans), sep = "-"),
  )

cluster_profiles

## # A tibble: 4 × 5
##   cluster R_trans_mean F_trans_mean M_trans_mean R_trans_range
##   <fct>          <dbl>        <dbl>        <dbl> <chr>        
## 1 1               156.         4.12         690. 0-730        
## 2 2               517.         1.27         403. 1-760        
## 3 3               127.         1.25         221. 0-664        
## 4 4               232.         2.94        9492. 1-731

4.3 Summary

4.3.1 Cluster 1: Frequent Purchasers

Profile: R_trans_mean: 155.9044, F_trans_mean: 4.1157, M_trans_mean: 690.2089, R_trans_range: [0, 730]
Interpretation: This cluster represents customers who purchase very frequently (high F_trans). The recent activity (R_trans) and monetary value (M_trans) are varied but on average are not particularly high or low. These customers may be loyal and engaged but not necessarily

4.3.2 Cluster 2: Inactive Moderate Spenders

Profile: R_trans_mean: 517.0262, F_trans_mean: 1.2655, M_trans_mean: 403.2855, R_trans_range: [1, 760]
Interpretation: This cluster represents customers who are less recently active (high R_trans) and have low frequency (F_trans). Their spending (M_trans) is moderate. These customers may be at risk of churning or becoming inactive.

4.3.3 Cluster 3: Recently Active Low Spenders

Profile: R_trans_mean: 127.3497, F_trans_mean: 1.2523, M_trans_mean: 221.4865, R_trans_range: [0, 664]
Interpretation: This cluster includes customers who are recently active (low R_trans) but have relatively low frequency (F_trans) and monetary value (M_trans). These may be newer customers or those who are engaged but spend less.

4.3.4 Cluster 4: Volatile High Spenders

Profile: R_trans_mean: 231.7296, F_trans_mean: 2.9380, M_trans_mean: 9491.8101, R_trans_range: [1, 731]
Interpretation: This cluster has a very high monetary value (M_trans), with scattered values for recent activity (R_trans) and frequency (F_trans). These customers may be high-value or big spenders but with inconsistent engagement patterns.

4.4 Final Visualization with GMM

RFM_score_t_filtered$cluster <- factor(RFM_gmm_4$classification,
                                       levels = 1:4,
                                       labels = c("Frequent Purchasers",
                                                  "Inactive Moderate Spenders",
                                                  "Recently Active Low Spenders",
                                                  "Volatile High Spenders"))

plot_ly(RFM_score_t_filtered, 
        x = ~R_trans, y = ~F_trans, z = ~M_trans, color = ~cluster) |>
  add_markers(marker = list(size = marker_size)) |>
  layout(scene = list(xaxis = list(title = "R_trans"),
                      yaxis = list(title = "F_trans"),
                      zaxis = list(title = "M_trans")),
         title = "3D Scatterplot for RFM with GMM (G = 4)")

4.5 Distribution

gmm_cluster_count <- table(RFM_score_t_filtered$cluster)

gmm_cluster_count <- as.data.frame(gmm_cluster_count)

colnames(gmm_cluster_count) <- c("cluster", "count")

gmm_cluster_count <- gmm_cluster_count |>
  mutate(percentage = round(count / sum(count) * 100, 2))

plot_ly(gmm_cluster_count, ids = ~cluster, values = ~count, labels = ~cluster, 
                textinfo = "label+percent", insidetextorientation = "radial") |>
  add_pie(hole = 0.5) |> 
  layout(showlegend = TRUE,
         annotations = list(text = "GMM Clustering \n (G = 4)",
                            font = list(size = 20),
                            showarrow = FALSE),
         legend = list(orientation = "h", x = 0.5, y = -0.1, xanchor = "center"))

Statistical Modeling IV - Unsupervised Learning for RFM Clustering with K-Means and Gaussian Mixture Modeling

Zhongming Jiang

2023-08-16