Customer Segmentation Project

Block 1: Customer Analytics Project as part of the Ignitus Scholar Assignment

Raymond Lukwago A.R

06-October-2023

1.0 Determine the client business needs

This project used the Single Discrete variable customer segmentation technique to achieve its objective. More specifically, the project used according to Aravind, H. (2023) the RFM (recency, frequency, monetary) analysis. This is a behavioural based technique used to segment customers by examining their transaction history as to:

  1. how recently a customer has purchased (recency)
  2. how often they purchase (frequency)
  3. how much the customer spends (monetary)

Essentially, the methodological framework is based off the pareto principle that: 80 percent of your business comes from 20 percent of your customers. The RFM analysis combines the above three customer attributes to rank customers i.e., If they bought in the recent past, they get higher points. If they bought many times, they get higher scores. And if they spent bigger, they get more points. Combined together, these three scores create the Recency-Frequency-Monetary (RFM) score.

Finally, I segmented the customer database into different groups based on this RFM score.

1.1 Project Objective

This project intended:

To strategically assist an E-Commence enterprise in identifying and characterising a select group of high-value customers (VIPs), ensuring that this chosen cluster of possesses substantial economic potential for targeted marketing efforts.

1.2 Project Questions

Therefore, the project assignment was guided by the following questions:

  1. Which group of customers make up the high-value customer (VIPs) segment?
  2. What characteristics are common amongst the above select group of high-value customers?

1.3 Implementation Process

  1. Determine the client business needs
  2. Data Sourcing, Cleaning & Exploration
  3. Feature Creation
  4. Feature Selection
  5. Apply RFM Customer Segment Technique
  6. Analyse Results & Characterize Segment

1.4 About the Data

The Online retail data set was obtained from UCI Machine Learning Repository. The data set contains transaction data occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered online retail store. The company mainly sells unique all-occasion gifts. NB: Many customers of the company are wholesalers.

2.0 Data Sourcing, Cleaning & Exploration

2.1 Data Sourcing

Import all Necessary Libraries

Installed and loaded all necessary packages for this project. For Data cleaning and exploration, tidyverse was used. GGplot, treemapify and naniar were used for visualization of, general plots, 3D Matrix plots and missing values respectively. Knitr and PrettyDoc was used for R-Markdown templating.

Set Working Directory

knitr::opts_knit$set(root.dir = "F:/Projects/Ignitus-ML_Projects/0x00-RFM_Segmentation")

Load the Online Retail Data set

df <- read.csv("Ecommerce.csv")

Initial Data Exploration

An initial exploration of the columns of the data set, type of the data object under each of the columns was done. This is required to understand if the columns and the data under each of the columns meets the overall data set features and variable types.

kbl(head(df, 10), 
    caption = "Table 1 showing 1st 10 rows of the Original df Dataset",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12) %>%
  column_spec(9, background = "lightblue")
Table 1 showing 1st 10 rows of the Original df Dataset
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country X
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 29-Nov-16 2.55 17850 United Kingdom NA
536365 71053 WHITE METAL LANTERN 6 29-Nov-16 3.39 17850 United Kingdom NA
536365 84406B CREAM CUPID HEARTS COAT HANGER 8 29-Nov-16 2.75 17850 United Kingdom NA
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 29-Nov-16 3.39 17850 United Kingdom NA
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 29-Nov-16 3.39 17850 United Kingdom NA
536365 22752 SET 7 BABUSHKA NESTING BOXES 2 29-Nov-16 7.65 17850 United Kingdom NA
536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 29-Nov-16 4.25 17850 United Kingdom NA
536366 22633 HAND WARMER UNION JACK 6 29-Nov-16 1.85 17850 United Kingdom NA
536366 22632 HAND WARMER RED POLKA DOT 6 29-Nov-16 1.85 17850 United Kingdom NA
536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32 29-Nov-16 1.69 13047 United Kingdom NA
str(df)
## 'data.frame':    541909 obs. of  9 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: chr  "29-Nov-16" "29-Nov-16" "29-Nov-16" "29-Nov-16" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  $ X          : logi  NA NA NA NA NA NA ...
colnames(df)
## [1] "InvoiceNo"   "StockCode"   "Description" "Quantity"    "InvoiceDate"
## [6] "UnitPrice"   "CustomerID"  "Country"     "X"

From the initial exploration of the data set table (1) above, you do realize that there exists one extra column named, X that only contains NA values highlighted in light blue. Cross-checking with the original data set requirements above, this is not supposed to be part of the data set and thus was dropped the column from the data set. We also changed the InvoiceDate column to data type format, columns; InvoiceNo, StockCode, Description, CustomerID & Country

We do make a copy of the original imported data set. We shall subsequently make all changes to this duplicate copy. This is a precautionary practice to keep an original that we can revert back to in case something goes wrong with the duplicate copy.

dfCopy <-copy(df)
# Remove the last undesired column
dfCopy <- dplyr::select(dfCopy, -c(9))
# Create a vector object of all columns supposed to be converted to Categorical Data Type
categoricalColumns <- c('InvoiceNo', 'StockCode', 'Description', 'CustomerID', 'Country')
dfCopy[categoricalColumns] <- lapply(dfCopy[categoricalColumns], as.factor)
dfCopy$InvoiceDate <- as.Date(dfCopy$InvoiceDate, "%d-%b-%y")
kbl(head(dfCopy, 10), 
    caption = "Table 2 showing 1st 10 rows of the Modified dfCopy Dataset",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)
Table 2 showing 1st 10 rows of the Modified dfCopy Dataset
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2016-11-29 2.55 17850 United Kingdom
536365 71053 WHITE METAL LANTERN 6 2016-11-29 3.39 17850 United Kingdom
536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2016-11-29 2.75 17850 United Kingdom
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2016-11-29 3.39 17850 United Kingdom
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2016-11-29 3.39 17850 United Kingdom
536365 22752 SET 7 BABUSHKA NESTING BOXES 2 2016-11-29 7.65 17850 United Kingdom
536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 2016-11-29 4.25 17850 United Kingdom
536366 22633 HAND WARMER UNION JACK 6 2016-11-29 1.85 17850 United Kingdom
536366 22632 HAND WARMER RED POLKA DOT 6 2016-11-29 1.85 17850 United Kingdom
536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32 2016-11-29 1.69 13047 United Kingdom
str(dfCopy)
## 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ StockCode  : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
##  $ Description: Factor w/ 4224 levels "Dotcomgiftshop Gift Voucher \xa320.00",..: 4025 4033 936 1957 2978 3233 1571 1696 1693 263 ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: Date, format: "2016-11-29" "2016-11-29" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : Factor w/ 4372 levels "12346","12347",..: 4049 4049 4049 4049 4049 4049 4049 4049 4049 541 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...

We further transformed the following columns InvoiceNo, StockCode, Description, CustomerID, Country to categorical variables and the InvoiceDate into a datetime variable. See table 2 above for context.

2.2 Data Cleaning

Check and Handle Missing Values

The dfCopy data frame has to be further checked for missing values especially NA values in each of the columns. The aim to to remove any NA values under the primary key columns i.e., CustomerID that can be used to track the transaction data of each of customer across the time. Since all columns have zero NA values and only the CUstomerID column has 135,080 missing values. This accounts for 25% of the entire data frame. Since CustomerID column contains a unique Customer ID that identifies a customer (Primary Key), it is illogical to impute the missing values in some way and thus remove all rows with missing values.

# Check for missing values under each of the data frame columns
colSums(is.na(dfCopy))
##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country 
##      135080           0
gg_miss_var(dfCopy)

# Drop all rows under CustomerID column with missing values
dfCopy <- dfCopy %>%
  tidyr::drop_na(CustomerID)

Check and Handle Duplicates, Illogical values

SumOfDuplicates <- sum(duplicated(dfCopy))
count_starting_with_C <- sum(startsWith(as.character(dfCopy$InvoiceNo), 'C'))
cleanedDf <- dfCopy %>%
  mutate(InvNo_len = nchar(as.character(InvoiceNo))) %>%
  dplyr::filter(InvNo_len <= 6) %>%
  dplyr::distinct(.keep_all = TRUE)

Check for duplicate rows and remove them. It was discovered that there are 5226 duplicate rows and these were removed. We further removed 8905 values under the Invoice ID column that started with letter ‘c’. This is because from the original data set these values meant a cancellation happened thus a transaction was not completed. During the exploration we discovered that there were illogical values under each of the main columns like the Unit Price and Quantity columns. Furthermore, it was discovered that all unit prices and quantities under the unit price or quantity columns did not make logical sense for this particular assignment since they were negatives.

Before, this stage of the preliminary analysis, the data frame was composed of 406829observations. After, the cleaning processes above, we remained with a data frame composed of 392731 observations (See table 3 below). We further crosschecked the cleaned data frame for duplicates and illogical values below and zero sums were returned.

count_start_with_C <- sum(startsWith(as.character(cleanedDf$InvoiceNo), 'C'))
SumOfDups_After <- sum(duplicated(cleanedDf))
kbl(head(cleanedDf, 10), 
    caption = "Table 3 showing 1st 10 rows of the Modified cleanedDf Dataset",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)
Table 3 showing 1st 10 rows of the Modified cleanedDf Dataset
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvNo_len
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2016-11-29 2.55 17850 United Kingdom 6
536365 71053 WHITE METAL LANTERN 6 2016-11-29 3.39 17850 United Kingdom 6
536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2016-11-29 2.75 17850 United Kingdom 6
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2016-11-29 3.39 17850 United Kingdom 6
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2016-11-29 3.39 17850 United Kingdom 6
536365 22752 SET 7 BABUSHKA NESTING BOXES 2 2016-11-29 7.65 17850 United Kingdom 6
536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 2016-11-29 4.25 17850 United Kingdom 6
536366 22633 HAND WARMER UNION JACK 6 2016-11-29 1.85 17850 United Kingdom 6
536366 22632 HAND WARMER RED POLKA DOT 6 2016-11-29 1.85 17850 United Kingdom 6
536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32 2016-11-29 1.69 13047 United Kingdom 6

2.3 Data Exploration

Conducted quick summary of the descriptive statistics on each of the cleanedDf data object to get a better understanding of the structure of our data under each of the columns.

From my summary I learned particularly for column unit price that had zero values. Under the Country column, I did also notice that there are country names presented as abbreviations. Under the country column, 38 different unique countries were presented and of these three; USA, RSA & EIRE were country names presented as abbreviations. One column under the same column was labelled Unspecified.

Since the percentage of rows that were labelled Unspecified accounted for less than 1% of the total data, we took a deliberate decision to remove all rows labelled, Unspecified. About the country abbreviations, I undertook some research and established that EIRE matches to the Irish Gaelic name for Ireland, RSA matches to South Africa and USA matches to United States of America. We replaced the mentioned country abbreviations as is below.

Additionally, from the summary statistics in table 4 below we noticed that there were zero values under the Unit Price column. We decided to remove these since they could be representing order returns or errors I decided to remove all rows with zero values under the unit price column.

summary(cleanedDf) %>%
  kbl(caption = "Table 4 showing summary statistics for cleanedDf") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)
Table 4 showing summary statistics for cleanedDf
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvNo_len
576339 : 542 85123A : 2023 WHITE HANGING HEART T-LIGHT HOLDER: 2016 Min. : 1.00 Min. :2016-11-29 Min. : 0.000 17841 : 7676 United Kingdom:349226 Min. :6
579196 : 533 22423 : 1714 REGENCY CAKESTAND 3 TIER : 1714 1st Qu.: 2.00 1st Qu.:2017-04-05 1st Qu.: 1.250 14911 : 5672 Germany : 9027 1st Qu.:6
580727 : 529 85099B : 1615 JUMBO BAG RED RETROSPOT : 1615 Median : 6.00 Median :2017-07-29 Median : 1.950 14096 : 5111 France : 8327 Median :6
578270 : 442 84879 : 1395 ASSORTED COLOUR BIRD ORNAMENT : 1395 Mean : 13.15 Mean :2017-07-08 Mean : 3.126 12748 : 4413 EIRE : 7228 Mean :6
573576 : 435 47566 : 1390 PARTY BUNTING : 1390 3rd Qu.: 12.00 3rd Qu.:2017-10-18 3rd Qu.: 3.750 14606 : 2677 Spain : 2480 3rd Qu.:6
567656 : 421 20725 : 1304 LUNCH BAG RED RETROSPOT : 1303 Max. :80995.00 Max. :2017-12-07 Max. :8142.750 15311 : 2366 Netherlands : 2363 Max. :6
(Other):389829 (Other):383290 (Other) :383298 NA NA NA (Other):364816 (Other) : 14080 NA
cleanedDf %>%
  group_by(Country) %>%
  tally() %>%
ungroup() %>%
  arrange(desc(n)) %>%
  kbl(caption = "Table 5 Showing Customer Count Aggregated by Country") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12) %>%
  scroll_box(width = "100%", height = "300px")
Table 5 Showing Customer Count Aggregated by Country
Country n
United Kingdom 349226
Germany 9027
France 8327
EIRE 7228
Spain 2480
Netherlands 2363
Belgium 2031
Switzerland 1842
Portugal 1453
Australia 1184
Norway 1072
Italy 758
Channel Islands 747
Finland 685
Cyprus 603
Sweden 450
Austria 398
Denmark 380
Poland 330
Japan 321
Israel 245
Unspecified 241
Singapore 222
Iceland 182
USA 179
Canada 151
Greece 145
Malta 112
United Arab Emirates 68
European Community 60
RSA 58
Lebanon 45
Lithuania 35
Brazil 32
Czech Republic 25
Bahrain 17
Saudi Arabia 9
# Replace all rows in Country column that meets criteria in rep_str
rep_str <- c('EIRE'='Ireland', 'USA'='United States of America', 'RSA'='South Africa')
cleanedDf <- cleanedDf %>%
  mutate(Country = stringr::str_replace_all(Country, rep_str)) %>%
  filter(Country != 'Unspecified', UnitPrice != 0.000) %>%
  mutate(Country = fct_relevel(Country), 
         Revenue = Quantity * UnitPrice,
         Recency_days = as.Date("2017-12-08") - InvoiceDate) %>%
  dplyr::select(-c(2,3,9))

3.0 Feature Creation

From the above code pipeline, we further created a Revenue column to support us during the scoring process for the Monetary analysis i.e., how much each of the customers transacted in the given time frame. Calculation for the Revenue was done by multiplying the unit price and quantity columns for each of the rows in the data object. Furthermore, we created the Time_Dist column to support us during the weighting process for recency analysis.

4.0 Feature Selection

Borrowing from the methodological framework of RFM Analysis, only kept the following columns; Invoice No, Quantity, Invoice Date, Unit Price, CustomerID, Country, Time_Dist and Revenue.

In order to create the RFM data set to be used for further analysis, we grouped the data set by unique customer id and created three additional columns namely; Total Revenue, Recency Days and Total Orders. The Total Revenue column was calculated from the sum of all revenues by a given customer, Recency days column is calculated from the most recent date of transaction purchase date (minimum value under Time_Dist column) and lastly the Total Orders column is calculated from the total number of transaction orders done by a particular customer i.e., Row count under each of the customer group. The transformed RFM data set is shown below under table 6.

df_rfm <- cleanedDf %>%
  group_by(CustomerID) %>%
  summarise(
    transaction_count = n(),
    amount = sum(Revenue),
    Recency_days = min(Recency_days)
  ) %>%
  ungroup()

kbl(head(df_rfm, 10), 
    caption = "Table 6 showing Modified Dataset for RFM Analysis",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)
Table 6 showing Modified Dataset for RFM Analysis
CustomerID transaction_count amount Recency_days
12346 1 77183.60 326 days
12347 182 4310.00 3 days
12348 31 1797.24 76 days
12349 73 1757.55 19 days
12350 17 334.40 311 days
12352 85 2506.04 37 days
12353 4 89.00 205 days
12354 58 1079.40 233 days
12355 13 459.40 215 days
12356 59 2811.43 23 days

4.1 Further Data Exploration

We further explored the distribution of the of the above transaction count, amount and Recency Days columns and they were all highly skewed to the left.

df_rfm %>% 
  ggplot(aes(x = transaction_count)) +
  geom_density(color="darkblue", fill="lightblue") +
  theme_minimal() +
  labs(subtitle = "Frequency Distribution Plot for Transaction Count",
       x = "Number of Transactions",
       caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(
    panel.grid = element_blank(),
    panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10))

df_rfm %>% 
  ggplot(aes(x = amount)) +
  geom_density(color="darkblue", fill="lightblue") +
  theme_minimal() +
  labs(subtitle = "Frequency Distribution Plot for Total Revenue",
       x = "Total Revenue",
       caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(
    panel.grid = element_blank(),
    panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10))

df_rfm %>% 
  ggplot(aes(x = Recency_days)) +
  geom_density(color="darkblue", fill="lightblue") +
  theme_minimal() +
  labs(subtitle = "Frequency Distribution Plot for Minimum Recency Days",
       x = "Minimum Last Date of Purchase",
       caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(
    panel.grid = element_blank(),
    panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10))
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

5.0 Apply RFM Customer Segment Technique

analysis_date = as.Date("2017-12-08")
tmp_table <- rfm_table_customer(data = df_rfm,
                               customer_id = CustomerID,
                               n_transactions = transaction_count,
                               recency_days = Recency_days,
                               total_revenue = amount,
                               analysis_date = analysis_date)
rfm_table <- as.data.frame(tmp_table$rfm)

Based on the RFM Score created for each of the customers under the rfm_table, we further segmented the customers into 11 distinctive segments. Segments were based on the table outlined in this article. In our context all Customers labelled ‘Champions’ were our VIP group that we need to identify and further characterize.

segment_names <- c("VIPs", "Loyal Customers", "Potential Loyalist",
  "New Customers", "Promising", "Need Attention", "About To Sleep",
  "At Risk", "Can't Lose Them", "Lost")

recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)

rfm_tab2 <- rfm_segment(tmp_table, segment_names, recency_lower, recency_upper,
frequency_lower, frequency_upper, monetary_lower, monetary_upper)
kbl(head(rfm_tab2, 10),
    caption = "Table 7 showing Customer Segmentation based on RFM Scores",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)
Table 7 showing Customer Segmentation based on RFM Scores
customer_id segment rfm_score transaction_count recency_days amount recency_score frequency_score monetary_score
12346 Others 115 1 326 days 77183.60 1 1 5
12347 VIPs 555 182 3 days 4310.00 5 5 5
12348 Loyal Customers 234 31 76 days 1797.24 2 3 4
12349 VIPs 444 73 19 days 1757.55 4 4 4
12350 At Risk 122 17 311 days 334.40 1 2 2
12352 Loyal Customers 345 85 37 days 2506.04 3 4 5
12353 Lost 111 4 205 days 89.00 1 1 1
12354 At Risk 144 58 233 days 1079.40 1 4 4
12355 Lost 112 13 215 days 459.40 1 1 2
12356 VIPs 445 59 23 days 2811.43 4 4 5

6.0 Analyse Results & Characterize Segment

6.1 Analysis of Results

Now that we have defined and segmented our customers, we examined the distribution of customers across the segments using a Tree Map. This provided clarity on the VIP segment of this E-commence business.

rfm_tab2 %>%
  group_by(segment) %>%
  summarise(
    Count = n()
  ) %>%
  ungroup() %>%
  ggplot(aes(area = Count, fill = Count, label = paste(segment, Count, sep = "\n"))) + 
  geom_treemap() +
  geom_treemap_text(place = "centre", size = 10, colour = "white") + 
  labs(subtitle = "Customer Segmentation based on RFM Scores",
           caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}"))

From the visualization above, it is evident that high-value (VIPs) customers take up 21 percent of the overall customer segment. Most customers of this E-commence business are classified as loyal customers taking up 27 percent. Meaning they spend good money and respond to promotions well.

We further a profiled the customer segments to understand the distribution of their recency, purchase frequency and revenue contribution during the period.

median_recency_plot
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

From the bar graph above it is clear that VIPs had purchased from the E-Commence store recently while the Lost Customer segment had taken a long time since they did purchase from the E-Commence store.

median_frequency_plot

Moving right to the bar graph visualizing the median frequency by segment. Again, it is noticable that the VIPs made the highest purchases recurrently but also contributed the most to the revenue of the E-commence business (Ref to Median Monetary Value by Segment graph below)

median_monetary_plot

6.2 Characterization of VIP Customer Segment

Ran some summary descriptive statistics on the filtered VIP customer segment and:

  • The minimum number of transactions for all customers under this segment was 58 transactions totaling to a minimum of 935 pounds during this reporting period.
  • On average each VIP customer overall spent 2,668 pounds and a maximum of 280,206 pounds.
rfm_tab2 %>% filter(segment == "VIPs") %>% summary() %>%
  kbl(caption = "Table 8 showing summary statistics for rfm_table") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)
Table 8 showing summary statistics for rfm_table
customer_id segment rfm_score transaction_count recency_days amount recency_score frequency_score monetary_score
12347 : 1 Length:918 Min. :444.0 Min. : 58.00 Length:918 Min. : 935 Min. :4.000 Min. :4.000 Min. :4.000
12349 : 1 Class :character 1st Qu.:455.0 1st Qu.: 97.25 Class :difftime 1st Qu.: 1682 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:4.000
12356 : 1 Mode :character Median :544.0 Median : 162.00 Mode :numeric Median : 2668 Median :5.000 Median :5.000 Median :5.000
12362 : 1 NA Mean :509.2 Mean : 256.59 NA Mean : 6112 Mean :4.582 Mean :4.636 Mean :4.642
12364 : 1 NA 3rd Qu.:555.0 3rd Qu.: 281.50 NA 3rd Qu.: 4808 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000
12380 : 1 NA Max. :555.0 Max. :7676.00 NA Max. :280206 Max. :5.000 Max. :5.000 Max. :5.000
(Other):912 NA NA NA NA NA NA NA NA
vip_table <- rfm_tab2 %>%
  filter(segment == "VIPs") %>%
  inner_join(dfCopy, by = c("customer_id" = "CustomerID")) %>%
  mutate(recency_days = as.numeric(recency_days),
         Revenue = Quantity * UnitPrice,
         Country = stringr::str_replace_all(Country, rep_str),
         Country = fct_relevel(Country),
         Month = lubridate::month(InvoiceDate, label=TRUE, abbr=FALSE),
         Year = year(InvoiceDate)) %>%
  select(-c(3,4,5,6,7,8,9))

kbl(head(vip_table, 10),
    caption = "Table 9 showing VIP Segment Customers",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)
Table 9 showing VIP Segment Customers
customer_id segment InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice Country Revenue Month Year
12347 VIPs 537626 85116 BLACK CANDELABRA T-LIGHT HOLDER 12 2016-12-05 2.10 Iceland 25.2 December 2016
12347 VIPs 537626 22375 AIRLINE BAG VINTAGE JET SET BROWN 4 2016-12-05 4.25 Iceland 17.0 December 2016
12347 VIPs 537626 71477 COLOUR GLASS. STAR T-LIGHT HOLDER 12 2016-12-05 3.25 Iceland 39.0 December 2016
12347 VIPs 537626 22492 MINI PAINT SET VINTAGE 36 2016-12-05 0.65 Iceland 23.4 December 2016
12347 VIPs 537626 22771 CLEAR DRAWER KNOB ACRYLIC EDWARDIAN 12 2016-12-05 1.25 Iceland 15.0 December 2016
12347 VIPs 537626 22772 PINK DRAWER KNOB ACRYLIC EDWARDIAN 12 2016-12-05 1.25 Iceland 15.0 December 2016
12347 VIPs 537626 22773 GREEN DRAWER KNOB ACRYLIC EDWARDIAN 12 2016-12-05 1.25 Iceland 15.0 December 2016
12347 VIPs 537626 22774 RED DRAWER KNOB ACRYLIC EDWARDIAN 12 2016-12-05 1.25 Iceland 15.0 December 2016
12347 VIPs 537626 22775 PURPLE DRAWERKNOB ACRYLIC EDWARDIAN 12 2016-12-05 1.25 Iceland 15.0 December 2016
12347 VIPs 537626 22805 BLUE DRAWER KNOB ACRYLIC EDWARDIAN 12 2016-12-05 1.25 Iceland 15.0 December 2016
summary(vip_table) %>%
  kbl(caption = "Table 10 showing summary statistics for VIP Customer table") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)
Table 10 showing summary statistics for VIP Customer table
customer_id segment InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice Country Revenue Month Year
17841 : 7983 Length:243842 576339 : 542 85123A : 1171 WHITE HANGING HEART T-LIGHT HOLDER: 1165 Min. :-9360.00 Min. :2016-11-29 Min. : 0.00 United Kingdom:213984 Min. :-3825.36 November :46587 Min. :2016
14911 : 5903 Class :character 579196 : 533 85099B : 1124 JUMBO BAG RED RETROSPOT : 1124 1st Qu.: 2.00 1st Qu.:2017-04-12 1st Qu.: 1.25 Ireland : 7323 1st Qu.: 4.20 October :27860 1st Qu.:2017
14096 : 5128 Mode :character 580727 : 529 22423 : 1093 REGENCY CAKESTAND 3 TIER : 1093 Median : 5.00 Median :2017-08-05 Median : 1.95 Germany : 6807 Median : 11.40 September:23777 Median :2017
12748 : 4642 NA 578270 : 442 20725 : 978 LUNCH BAG RED RETROSPOT : 977 Mean : 12.88 Mean :2017-07-13 Mean : 3.16 France : 5733 Mean : 22.38 December :23742 Mean :2017
14606 : 2782 NA 573576 : 435 20727 : 804 LUNCH BAG BLACK SKULL. : 804 3rd Qu.: 12.00 3rd Qu.:2017-10-25 3rd Qu.: 3.75 Netherlands : 2152 3rd Qu.: 19.80 May :18238 3rd Qu.:2017
15311 : 2491 NA 567656 : 421 47566 : 793 PARTY BUNTING : 793 Max. : 4800.00 Max. :2017-12-07 Max. :3155.95 Spain : 1627 Max. : 7144.72 August :18018 Max. :2017
(Other):214913 NA (Other):240940 (Other):237879 (Other) :237886 NA NA NA (Other) : 6216 NA (Other) :85620 NA
  • We further investigated which countries most of the high-value customers came from. Most of the high-value Customers were from United Kingdom, followed by Ireland and German in this order. Get more information from the visualization below.
vip_table %>%
  group_by(Country) %>%
  summarise(
    Count = n()
  ) %>%
  ungroup() %>%
  ggplot(aes(x = Count, y = reorder(Country, Count))) +
           geom_col(width = 0.7, fill = "lightblue") +
           theme_minimal() +
           labs(subtitle = "VIP Customers Disaggregated by Country",
           x = "Count",
           y = "Country",
           caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(
    panel.grid = element_blank(),
    panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10))

  • White Hanging Heart T-Light Holder, Jumbo Bag Red RetroSpot and Regency Cakestand 3 Tie were the most popularly purchased product items by this VIP customer segment.

Lastly, since this is a gift shop sort of E-commence business, it would be nice to know what months or season of the year this high-value customer segment makes the most purchases.

vip_table %>%
  filter(Year == 2017) %>%
  group_by(Month) %>%
  summarise(
    Total_Revenue = sum(Revenue),
    Total_Invoices = length(unique(InvoiceNo))
  ) %>%
  ungroup() %>%
  ggplot(aes(x = Total_Invoices, y = Month, size = Total_Revenue, color = Month)) +
  geom_point(alpha=0.5) +
  scale_size(range = c(.1, 24), name="Total Revenue") +
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A") +
  theme_minimal() +
  theme(
    legend.position="bottom",
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10)) +
  labs(subtitle = " Plot showing Sales per Month Aggregated by Revenue",
           x = "Total Monthly Sales",
           y = "Months of the Year",
           caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(legend.position="none")

From the above plot, it is evident that most of the customers clustered under the high-value or VIP segment of this project make the most purchases and generate the most revenue during the last quarter of the year i.e., Months of October, November and December.

7.0 Conclusion

We can use promotional marketing strategies like discounts on commonly purchased items or discounts applied if a certain total invoice sum is surpassed during the last quarter of the year as a nudge to entice this segment of customers to keep around but also entice the loyal customer segment to join the high-value customers. During the other seasons of the year, we can use cues to reminder, to remind this segment of customers of existing exiting deals, products and offers.