1 Panel View

Let’s first take a look at the structure of customer/project pair over time in the unit of bi-week.

imputed_PDF_biweek_indiv_clean$account_created_at <- 
  as.Date(imputed_PDF_biweek_indiv_clean$account_created_at)

imputed_PDF_biweek_indiv_clean <- 
  imputed_PDF_biweek_indiv_clean |> group_by(user_project_pair) |>
  mutate(min_biweek = ifelse(any(observed == 1), min(biweek_i_indiv[observed == 1], na.rm = TRUE), NA),
         max_biweek = ifelse(any(observed == 1), max(biweek_i_indiv[observed == 1], na.rm = TRUE), NA),
         active_period = ifelse(biweek_i_indiv >= min_biweek & biweek_i_indiv <= max_biweek, 1, observed),
         account_created_min = min(account_created_at, na.rm = TRUE)) |> ungroup() |> 
  arrange(account_created_min, user_project_pair, biweek_i_indiv) |> 
  select(-min_biweek, -max_biweek, -account_created_min) 

imputed_PDF_biweek_indiv_clean_seg <- imputed_PDF_biweek_indiv_clean[, c(1:2, 53, 55)]

The following time-series plot of self-pacing transaction and redemption history is updated from Statistical Modeling II [3] by employing the new definition of active_period.

Notations:

  • 1: obs_TR (observed data with transaction/redemption recorded)
  • 0: obs_nTR (observed data with transaction/redemption NOT recorded)
  • -1: mis (missing data, mainly for counterfactual inference if it hasn’t happened yet)
panelview(D = "active_period", 
          data = imputed_PDF_biweek_indiv_clean_seg, 
          index = c("user_project_pair", "biweek_i_indiv"), 
          display.all = TRUE, type = "treat", by.timing = TRUE,
          xlab = "Biweek Index", ylab = "User Project Pair",
          axis.lab.gap = c(2,10), main = "Self-Pacing Transaction & Redemption History")
## If the number of units is more than 300, we set "gridOff = TRUE".
## 3 treatment levels.

2 Frequency View

In this section, we create obs_TR_count_freq to record the number of frequency for different counts of transaction and redemption (in short, TR).

obs_TR <- imputed_PDF_biweek_indiv_clean[imputed_PDF_biweek_indiv_clean$observed == 1,]

obs_TR_count <- table(obs_TR$user_project_pair)

obs_TR_count_freq <- table(obs_TR_count)

obs_TR_count_freq <- as.data.frame(obs_TR_count_freq)

colnames(obs_TR_count_freq) <- c("obs_count", "freq_obs_count")
ggplot(obs_TR_count_freq, aes(x = obs_count, y = freq_obs_count)) +
  geom_bar(stat = "identity", fill = "steelblue2", color = "black") +
  labs(title = "Frequency of Count of TR Made Within 1 Bi-week",
       x = "Count of TR",
       y = "Frequency") +
  geom_text(aes(label = freq_obs_count), vjust = -0.5, size = 3) +
  theme_minimal()

3 Subsetting Technique for Count of TR

The function obs_TR_pair_k takes input k to return a vector of customer/project IDs with count of TR equal to k. This reduces the time for us to implement subsetting examples.

obs_TR_pair_k <- function(k) {

  obs_TR_count_k <- as.vector(names(obs_TR_count[obs_TR_count == k]))

  return(obs_TR_count_k)
  
}

3.1 Example: Count of Observed TR > 20

The following example is by subsetting customer/project IDs active periods over 20. A panel view is provided as usual.

obs_TR_count_gt20 <- as.vector(names(obs_TR_count[obs_TR_count > 20]))

length(obs_TR_count_gt20)
## [1] 365
model_PDF_TRgt20 <- 
  imputed_PDF_biweek_indiv_clean[imputed_PDF_biweek_indiv_clean$user_project_pair %in% obs_TR_count_gt20, ]

dim(model_PDF_TRgt20)
## [1] 20075    55
panelview(D = "active_period", 
          data = model_PDF_TRgt20, 
          index = c("user_project_pair", "biweek_i_indiv"), 
          display.all = TRUE, type = "treat", by.timing = TRUE,
          xlab = "Biweek Index", ylab = "User Project Pair",
          axis.lab.gap = c(2,10), main = "Self-Pacing TR History (active periods > 20)")
## If the number of units is more than 300, we set "gridOff = TRUE".
## 3 treatment levels.

3.2 Example Continued: Validation

For validation purpose, we will hide 20% of data and see if we can recover it using 80% of existing data.

model_PDF_TRgt20 <- model_PDF_TRgt20[, -c(8, 10, 51:52, 54)]

model_PDF_TRgt20_imputed <- model_PDF_TRgt20

model_PDF_TRgt20_imputed[, 4:10] <- na.locf(model_PDF_TRgt20_imputed[, 4:10])

model_PDF_TRgt20_imputed[, 11:48][is.na(model_PDF_TRgt20_imputed[, 11:48])] <- 0
model_PDF_TRgt20_imputed$hide <- 0

model_PDF_TRgt20_imputed$hide[model_PDF_TRgt20_imputed$biweek_i_indiv >= 44 & 
                              model_PDF_TRgt20_imputed$biweek_i_indiv <= 54] <- 1

model_PDF_TRgt20_imputed_train <- 
  model_PDF_TRgt20_imputed[model_PDF_TRgt20_imputed$hide == 0,]

model_PDF_TRgt20_imputed_test <- 
  model_PDF_TRgt20_imputed[model_PDF_TRgt20_imputed$hide == 1,]

dim(model_PDF_TRgt20_imputed_train)
## [1] 16060    51
dim(model_PDF_TRgt20_imputed_test)
## [1] 4015   51
panelview(D = "active_period", 
          data = model_PDF_TRgt20_imputed_train, 
          index = c("user_project_pair", "biweek_i_indiv"), 
          display.all = TRUE, type = "treat", by.timing = TRUE,
          xlab = "Biweek Index", ylab = "User Project Pair",
          axis.lab.gap = c(2,10), main = "Self-Pacing TR History (active periods > 20, training data)")
## If the number of units is more than 300, we set "gridOff = TRUE".
## 3 treatment levels.

3.3 Example Continued: Binary Treatment

This sub-section is to take 3 treatment levels above into binary treatment, where 1 and 0 above is now under control, and -1 is now under treatment.

model_PDF_TRgt20_imputed_train$D_i <- 
  ifelse(model_PDF_TRgt20_imputed_train$observed == -1, 1, 0)

model_PDF_TRgt20_imputed_train$treat <- ave(model_PDF_TRgt20_imputed_train$D_i, 
                                            model_PDF_TRgt20_imputed_train$user_project_pair, 
                                            FUN = function(x) as.integer(any(x == 1)))
panelview(D = "D_i", 
          data = model_PDF_TRgt20_imputed_train, 
          index = c("user_project_pair", "biweek_i_indiv"), 
          display.all = TRUE, type = "treat", by.timing = TRUE,
          xlab = "Biweek Index", ylab = "User Project Pair",
          axis.lab.gap = c(2,10), main = "Self-Pacing TR History (active periods > 20, binary treatment training data)")
## If the number of units is more than 300, we set "gridOff = TRUE".

Unfortunately, our data is not suitable for directly implementing bpCausal since their software does not accept both staggered adoption (where different customers adopt purchase or redemption at different times) and staggered treatment assignment (where different customers receive the treatment at different times). Instead of rearranging the data to use their model, we aim to create our own model to directly resolve this problem. Such an issue will be discussed in Statistical Modeling V in a .ipynb document.

Hence, the concentration of Statistical Modeling III is focusing on customer segmentation under supervised learning. Due to heavy computations, unsupervised clustering methods will be discussed separately in Statistical Modeling IV.

4 Rearranging Panel Data Structure

We will rearrange the panel data structure of PDF that is tailored to work perfectly in our designed program. First, we rename several columns below for easy reference.

names(PDF)[names(PDF) == "user_project_pair"] <- "pair"

names(PDF)[names(PDF) == "total_redemption_amount"] <- "redem"

names(PDF)[names(PDF) == "amount_charged_in_usd"] <- "charged"

names(PDF)[names(PDF) == "credit_given_in_usd"] <- "trans"

names(PDF)[names(PDF) == "cumsum_total_redemption_amount"] <- "cumsum_redem"

names(PDF)[names(PDF) == "cumsum_amount_charged_in_usd"] <- "cumsum_charged"

names(PDF)[names(PDF) == "cumsum_credit_given_in_usd"] <- "cumsum_trans"

4.1 Recency metrics

The recency pattern discovered in EDA III [1] and IV [2] is an important indicator, so we will display such information in our modeling data, PDF. The detailed steps here are:

  1. Define last_date as the maximum value of the created_at column in PDF.

  2. Create 3 new columns, recency_trans, recency_redem, and recency_charged, in PDF.

  3. If redem is not equal to 0, calculate the difference between created_at and last_date and record the difference in recency_redem. If redem is 0, record Inf in recency_redem.

  4. Similarly, if trans is not equal to 0, calculate the difference between created_at and last_date and record the difference in recency_trans. If trans is 0, record Inf in recency_trans.

  5. Similarly, if charged is not equal to 0, calculate the difference between created_at and last_date and record the difference in recency_charged. If charged is 0, record Inf in recency_charged.

last_date <- max(PDF$created_at)

PDF$recency_trans <- round(ifelse(PDF$trans != 0, 
                                  as.numeric(difftime(last_date, PDF$created_at, units = "days")), Inf))

PDF$recency_redem <- round(ifelse(PDF$redem != 0, 
                                  as.numeric(difftime(last_date, PDF$created_at, units = "days")), Inf))

PDF$recency_charged <- round(ifelse(PDF$charged != 0, 
                                    as.numeric(difftime(last_date, PDF$created_at, units = "days")), Inf))

head(PDF[, c(3,4, 80:82)], 10)
##          pair created_at recency_trans recency_redem recency_charged
## 1  100001-695 2022-04-22           284           Inf             284
## 2  100002-676 2022-04-22           Inf           284             Inf
## 3  100002-676 2022-04-23           283           Inf             283
## 4  100003-185 2022-04-22           Inf           284             Inf
## 5  100003-185 2022-04-22           284           Inf             284
## 6  100003-185 2022-04-23           283           Inf             283
## 7  100005-692 2022-04-22           Inf           284             Inf
## 8  100005-692 2023-01-09           Inf            22             Inf
## 9  100005-692 2023-01-09            22           Inf              22
## 10 100007-695 2022-04-23           283           Inf             283

4.2 Merging summarization and additional covariates

This sub-section will begin the merging summary for PDF, taking essential covariates for our consequent analyses. Notice that PDF_cov is created as a simplified copy of PDF, and we rename the time index biweek_i_indiv as t for simplicity.

PDF_cov <- PDF[, c(3, 6, 7:9, 11, 12:14, 16, 80:82, 17, 22:61, 76, 62:66, 78:79)]

names(PDF_cov)[names(PDF_cov) == "biweek_i_indiv"] <- "t"

Our merging summary data frame is called PDF_JA, which refers to the panel data frame with jagged arrays. The expanding covariates are case-dependent:

  1. For numeric columns, such as redem, charged, trans, and tip, their average amount (avg_redem, etc.) and frequency amount is created (freq_redem, etc.).

  2. For cumulative numeric columns, such as cumsum_redem, cumsum_charged, cumsum_trans, and cumsum_tip, their maximum amount is returned (max_cumsum_trans, etc.).

  3. For recency related columns, such as recency_redem, recency_trans, and recency_charged, their minimum amount is returned (min_recency_redem, etc.).

  4. For character types of data and indicator variables like is_reconcilable, suspicious_promo_binary, and utm_binary, since each pair shares same information across them, these inputs are returned with no changes.

  5. For expanding factor types, such as Venue.Type...Detail_Bar, their frequency amount is returned (freq_Venue.Type...Detail_Bar).

PDF_JA <- PDF_cov |>
  group_by(pair, t) |>
  summarise(
    across(1:4,
           list(avg = ~round(mean(.), 2),
                freq = ~sum(. != 0 & !is.na(.))),
           .names = "{.fn}_{.col}"),
    across(5:8,
           list(max = max),
           .names = "{.fn}_{.col}"),
    across(9:11,
           list(min = min),
           .names = "{.fn}_{.col}"),
    across(12:14, ~first(.), .names = "{.col}"),
    across(15:52,
           ~sum(. != 0 & !is.na(.)),
           .names = "freq_{.col}"),
    across(53:60, ~first(.), .names = "{.col}")
  ) |>
  ungroup()
## `summarise()` has grouped output by 'pair'. You can override using the
## `.groups` argument.
colnames(PDF_JA)
##  [1] "pair"                                       
##  [2] "t"                                          
##  [3] "avg_redem"                                  
##  [4] "freq_redem"                                 
##  [5] "avg_charged"                                
##  [6] "freq_charged"                               
##  [7] "avg_trans"                                  
##  [8] "freq_trans"                                 
##  [9] "avg_tip"                                    
## [10] "freq_tip"                                   
## [11] "max_cumsum_redem"                           
## [12] "max_cumsum_charged"                         
## [13] "max_cumsum_trans"                           
## [14] "max_cumsum_tip"                             
## [15] "min_recency_trans"                          
## [16] "min_recency_redem"                          
## [17] "min_recency_charged"                        
## [18] "is_reconcilable"                            
## [19] "suspicious_promo_binary"                    
## [20] "utm_binary"                                 
## [21] "freq_Venue.Type...Detail_Bar"               
## [22] "freq_Venue.Type...Detail_Cafe"              
## [23] "freq_Venue.Type...Detail_Casual.Dining"     
## [24] "freq_Venue.Type...Detail_Casual.Fine.Dining"
## [25] "freq_Venue.Type...Detail_Coffee.Shop"       
## [26] "freq_Venue.Type...Detail_Fast.Casual"       
## [27] "freq_Venue.Type...Detail_Fast.Food"         
## [28] "freq_Venue.Type...Detail_Fine.Dining"       
## [29] "freq_Venue.Type...Detail_Market"            
## [30] "freq_Venue.Type...Detail_Nightclub"         
## [31] "freq_Venue.Type...Detail_Unknown"           
## [32] "freq_Check.Average_High"                    
## [33] "freq_Check.Average_Low"                     
## [34] "freq_Check.Average_Mid"                     
## [35] "freq_Check.Average_Very.High"               
## [36] "freq_Check.Average_Very.Low"                
## [37] "freq_Check.Average_Missing"                 
## [38] "freq_Service.Type_Fast.Casual"              
## [39] "freq_Service.Type_Full.Service"             
## [40] "freq_Service.Type_Ghost.Kitchen"            
## [41] "freq_Service.Type_QSR"                      
## [42] "freq_Service.Type_Unknown"                  
## [43] "freq_is_app_purchase_0"                     
## [44] "freq_is_app_purchase_1"                     
## [45] "freq_is_app_purchase_Missing"               
## [46] "freq_transaction_type_0"                    
## [47] "freq_transaction_type_1"                    
## [48] "freq_transaction_type_Missing"              
## [49] "freq_credit_type_0"                         
## [50] "freq_credit_type_5"                         
## [51] "freq_credit_type_Missing"                   
## [52] "freq_is_excess_0"                           
## [53] "freq_is_excess_1"                           
## [54] "freq_stripe_brand_American.Express"         
## [55] "freq_stripe_brand_Discover"                 
## [56] "freq_stripe_brand_MasterCard"               
## [57] "freq_stripe_brand_Visa"                     
## [58] "freq_stripe_brand_Other"                    
## [59] "account_created_at"                         
## [60] "project_name"                               
## [61] "Location"                                   
## [62] "project_location_id"                        
## [63] "city"                                       
## [64] "state"                                      
## [65] "user_app_version"                           
## [66] "user_app_platform"

5 Customer Segmentation - Supervised Learning with RFM

The RFM scoring system [4] is based on the following principles:

Recency (R): Represents the freshness of customer activity. A lower recency value indicates more recent activity, so a lower recency value corresponds to a higher score in R dimension. This reflects the fact that more recently active customers are often more engaged with the transaction.

Frequency (F): Reflects how often a customer purchases. A higher frequency value should correspond to a higher score in F dimension, as more frequent customers are typically more loyal.

Monetary (M): Represents the total revenue generated by the customer. A higher monetary value should correspond to a higher score in M dimension, as customers who spend more are generally more valuable to the business.

10 customer segmentations (excluding the Unknown type) are defined below with various levels for R, F, and M. This segmentation approach is inspired by RFM - Customer Level Data (https://rfm.rsquaredacademy.com/articles/rfm-customer-level-data.html) [4], though I edit the inner structure here a bit to better suit the data RFM. Notice that each of our F and M includes 3 sub-components, transaction, redemption, and charged amount.

segments <- list(
  "Champions" = c(5, 5, 5),
  "Potential Loyalist" = list(3:5, 3:5, 2:5),
  "Loyal Customers" = list(2:4, 2:4, 2:4),
  "Promising" = list(3:4, 1:3, 3:5),
  "New Customers" = list(4:5, 1:3, 1:5),
  "Can't Lose Them" = list(1:2, 3:4, 4:5),
  "At Risk" = list(1:2, 2:5, 4:5),
  "Need Attention" = list(1:3, 3:5, 3:5),
  "About To Sleep" = list(2:3, 1:3, 1:4),
  "Lost" = list(1, 1:5, 1:5)
)

assign_segments <- function(r_score, f_score, m_score) {
  
  for (segment in names(segments)) {
    
    r <- segments[[segment]][[1]]
    
    f <- segments[[segment]][[2]]
    
    m <- segments[[segment]][[3]]
    
    if (r_score %in% r & f_score %in% f & m_score %in% m) {
      
      return(segment)
      
    }
    
  }
  
  return("Unknown")
  
}

RFM <- PDF_JA |> group_by(pair) |>
  summarise(
    R_redem = min(min_recency_redem, na.rm = TRUE),
    R_trans = min(min_recency_trans, na.rm = TRUE),
    R_charged = min(min_recency_charged, na.rm = TRUE),
    F_redem = sum(freq_redem, na.rm = TRUE),
    F_trans = sum(freq_trans, na.rm = TRUE),
    F_charged = sum(freq_charged, na.rm = TRUE),
    M_redem = max(max_cumsum_redem, na.rm = TRUE),
    M_trans = max(max_cumsum_trans, na.rm = TRUE),
    M_charged = max(max_cumsum_charged, na.rm = TRUE)
    )

head(RFM)
## # A tibble: 6 × 10
##   pair       R_redem R_trans R_charged F_redem F_trans F_charged M_redem M_trans
##   <chr>        <dbl>   <dbl>     <dbl>   <int>   <int>     <int>   <dbl>   <dbl>
## 1 100001-695     Inf     284       284       0       1         1     0      48.6
## 2 100002-676     284     283       283       1       1         1   126.      1.5
## 3 100003-185     284     283       283       1       2         2   292.    292. 
## 4 100005-692      22      22        22       2       1         1   118      18  
## 5 100007-695     Inf     283       283       0       1         1     0      26.3
## 6 100008-692     284     283       283       1       1         1    19.5    19.5
## # ℹ 1 more variable: M_charged <dbl>

5.1 RFM with only transaction

Let’s first implement R_trans, F_trans, and M_trans for dimensions of R, F, and M. This is considered to be the most standard format of RFM model since it is only based on transaction history.

The output here is RFM_score_t.

RFM_score_t <- RFM |>
  mutate(
    R_score = ifelse(R_trans == Inf, 1, 6 - ntile(R_trans, 5)), 
    F_score = ntile(F_trans, 5),
    M_score = ntile(M_trans, 5),
    RFM_score = paste(R_score, F_score, M_score, sep = "")
  )

RFM_score_t$segment <- mapply(assign_segments, 
                              RFM_score_t$R_score, 
                              RFM_score_t$F_score, 
                              RFM_score_t$M_score)

RFM_score_t[1:10, c(1, 11:15)]
## # A tibble: 10 × 6
##    pair       R_score F_score M_score RFM_score segment           
##    <chr>        <dbl>   <int>   <int> <chr>     <chr>             
##  1 100001-695       3       2       2 322       Loyal Customers   
##  2 100002-676       3       2       2 322       Loyal Customers   
##  3 100003-185       3       5       4 354       Potential Loyalist
##  4 100005-692       5       2       2 522       New Customers     
##  5 100007-695       3       2       2 322       Loyal Customers   
##  6 100008-692       3       2       2 322       Loyal Customers   
##  7 100009-720       3       5       4 354       Potential Loyalist
##  8 100012-702       3       2       4 324       Loyal Customers   
##  9 100013-697       3       5       4 354       Potential Loyalist
## 10 100014-706       3       2       3 323       Loyal Customers

5.2 RFM with transaction, redemption, and charged amount

Let’s then implement a combination of R_trans, R_redem, and R_charged, a combination of F_trans, F_redem, and F_charged AND a combination of M_trans, M_redem, and M_charged for dimensions of R, F, and M. The logic implemented here is followed by:

  1. Recency (R): Since we have 3 sub-components (R_redem, R_trans, R_charged), we’ll consider the minimum of these 3 as the recency value. If any of them contains Inf, we’ll automatically mark the lowest ranking for R dimension.

  2. Frequency (F): We’ll sum the sub-components (F_redem, F_trans, F_charged) to get the frequency value.

  3. Monetary (M): Similarly, we’ll sum the sub-components (M_redem, M_trans, M_charged) to get the monetary value.

The output here is RFM_score.

RFM_score <- RFM |>
  mutate(
    R = pmin(R_redem, R_trans, R_charged, na.rm = TRUE),
    F_sum = F_redem + F_trans + F_charged,
    M_sum = M_redem + M_trans + M_charged
  ) |>
  mutate(
    R_score = ifelse(R == Inf, 1, 6 - ntile(R, 5)),
    F_score = ntile(F_sum, 5),
    M_score = ntile(M_sum, 5),
    RFM_score = paste(R_score, F_score, M_score, sep = "")
  )


RFM_score$segment <- mapply(assign_segments, 
                            RFM_score$R_score, 
                            RFM_score$F_score, 
                            RFM_score$M_score)

RFM_score[1:10, c(1, 14:18)]
## # A tibble: 10 × 6
##    pair       R_score F_score M_score RFM_score segment           
##    <chr>        <dbl>   <int>   <int> <chr>     <chr>             
##  1 100001-695       2       1       2 212       About To Sleep    
##  2 100002-676       2       2       2 222       Loyal Customers   
##  3 100003-185       2       4       5 245       Can't Lose Them   
##  4 100005-692       5       4       2 542       Potential Loyalist
##  5 100007-695       2       1       1 211       About To Sleep    
##  6 100008-692       2       2       1 221       About To Sleep    
##  7 100009-720       2       4       4 244       Loyal Customers   
##  8 100012-702       2       2       4 224       Loyal Customers   
##  9 100013-697       2       5       4 254       At Risk           
## 10 100014-706       2       2       3 223       Loyal Customers

5.3 Visualization: RFM with only transaction

colors <- brewer.pal(10, "Set3")

segment_descriptions <- c(
  "Champions" = "Champions (R: 5, F: 5, M: 5)",
  "Potential Loyalist" = "Potential Loyalist (R: 3-5, F: 3-5, M: 2-5)",
  "Loyal Customers" = "Loyal Customers (R: 2-4, F: 2-4, M: 2-4)",
  "Promising" = "Promising (R: 3-4, F: 1-3, M: 3-5)",
  "New Customers" = "New Customers (R: 4-5, F: 1-3, M: 1-5)",
  "Can't Lose Them" = "Can't Lose Them (R: 1-2, F: 3-4, M: 4-5)",
  "At Risk" = "At Risk (R: 1-2, F: 2-5, M: 4-5)",
  "Need Attention" = "Need Attention (R: 1-3, F: 3-5, M: 3-5)",
  "About To Sleep" = "About To Sleep (R: 2-3, F: 1-3, M: 1-4)",
  "Lost" = "Lost (R: 1, F: 1-5, M: 1-5)",
  "Unknown" = "Unknown"
)
ggplot(RFM_score_t, aes(x = segment, fill = segment_descriptions[segment])) +
  geom_bar() +
  scale_fill_brewer(palette = "Set3", na.value = "gray", name = "Segment Description") +
  ggtitle("Segment Distribution for RFM_score_t") +
  xlab("Segment") +
  ylab("Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "right")

marker_size <- 3

plot_ly(RFM_score_t, x = ~R_trans, y = ~F_trans, z = ~M_trans, 
        color = ~segment_descriptions[segment], colors = "Set3") |>
  add_markers(marker = list(size = marker_size)) |>
  layout(scene = list(xaxis = list(title = "R_trans"),
                      yaxis = list(title = "F_trans"),
                      zaxis = list(title = "M_trans")),
         title = "3D Scatterplot for RFM_score_t")

5.4 Visualization: RFM with transaction, redemption, and charged amount

ggplot(RFM_score, aes(x = segment, fill = segment_descriptions[segment])) +
  geom_bar() +
  scale_fill_brewer(palette = "Set3", na.value = "gray", name = "Segment Description") +
  ggtitle("Segment Distribution for RFM_score") +
  xlab("Segment") +
  ylab("Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "right")

plot_ly(RFM_score, x = ~R_trans, y = ~F_trans, z = ~M_trans, 
        color = ~segment_descriptions[segment], colors = "Set3") |>
  add_markers(marker = list(size = marker_size)) |>
  layout(scene = list(xaxis = list(title = "R_trans"),
                      yaxis = list(title = "F_trans"),
                      zaxis = list(title = "M_trans")),
         title = "3D Scatterplot for RFM_score")

5.5 Discussion

The segments of RFM_score_t and RFM_score are predefined, meaning that the actual underlying structure in the data might not be captured effectively. Also, the approach is rigid as it relies on specific rules for segmentation. Any changes in customer behavior or market dynamics might necessitate a change in these rules. Last, unlike probabilistic models, such classification approach does not provide any measure of uncertainty associated with the segment assignments. Given such disadvantages, we aim to redo such a problem from an unsupervised learning approach in Statistical Modeling IV. We will mainly take a look at 2 approaches: K-means Clustering, used heavily in data mining customer segmentation literature in marketing science, and Gaussian Mixture Models (GMM), a potentially new direction.

6 Reference

[1] Z. Jiang, “RPubs - Exploratory Data Analysis III - Redemptions and Revenue,” rpubs.com, May 13, 2023. https://rpubs.com/jiangzm/1041691

[2] Z. Jiang, “RPubs - Exploratory Data Analysis IV - Further Revision on Redemptions and Revenue,” rpubs.com, May 17, 2023. https://rpubs.com/jiangzm/1044338

[3] Z. Jiang, “RPubs - Statistical Modeling II - Visualization of Causal Panel Analysis on inKind Data,” rpubs.com, Aug. 04, 2023. https://rpubs.com/jiangzm/1069499

[4] A. Hebbali, “RFM - Customer Level Data,” rfm.rsquaredacademy.com, May 25, 2023. https://rfm.rsquaredacademy.com/articles/rfm-customer-level-data.html