Customer Segmentation Project

1.0 Determine the client business needs

This project used the Single Discrete variable customer segmentation technique to achieve its objective. More specifically, the project used according to Aravind, H. (2023) the RFM (recency, frequency, monetary) analysis. This is a behavioural based technique used to segment customers by examining their transaction history as to:

how recently a customer has purchased (recency)
how often they purchase (frequency)
how much the customer spends (monetary)

Essentially, the methodological framework is based off the pareto principle that: 80 percent of your business comes from 20 percent of your customers. The RFM analysis combines the above three customer attributes to rank customers i.e., If they bought in the recent past, they get higher points. If they bought many times, they get higher scores. And if they spent bigger, they get more points. Combined together, these three scores create the Recency-Frequency-Monetary (RFM) score.

Finally, I segmented the customer database into different groups based on this RFM score.

1.1 Project Objective

This project intended:

To strategically assist an E-Commence enterprise in identifying and characterising a select group of high-value customers (VIPs), ensuring that this chosen cluster of possesses substantial economic potential for targeted marketing efforts.

1.2 Project Questions

Therefore, the project assignment was guided by the following questions:

Which group of customers make up the high-value customer (VIPs) segment?
What characteristics are common amongst the above select group of high-value customers?

1.3 Implementation Process

Determine the client business needs
Data Sourcing, Cleaning & Exploration
Feature Creation
Feature Selection
Apply RFM Customer Segment Technique
Analyse Results & Characterize Segment

1.4 About the Data

The Online retail data set was obtained from UCI Machine Learning Repository. The data set contains transaction data occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered online retail store. The company mainly sells unique all-occasion gifts. NB: Many customers of the company are wholesalers.

2.0 Data Sourcing, Cleaning & Exploration

2.1 Data Sourcing

Import all Necessary Libraries

Installed and loaded all necessary packages for this project. For Data cleaning and exploration, tidyverse was used. GGplot, treemapify and naniar were used for visualization of, general plots, 3D Matrix plots and missing values respectively. Knitr and PrettyDoc was used for R-Markdown templating.

Set Working Directory

knitr::opts_knit$set(root.dir = "F:/Projects/Ignitus-ML_Projects/0x00-RFM_Segmentation")

Load the Online Retail Data set

df <- read.csv("Ecommerce.csv")

Initial Data Exploration

An initial exploration of the columns of the data set, type of the data object under each of the columns was done. This is required to understand if the columns and the data under each of the columns meets the overall data set features and variable types.

kbl(head(df, 10), 
    caption = "Table 1 showing 1st 10 rows of the Original df Dataset",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12) %>%
  column_spec(9, background = "lightblue")

Table 1 showing 1st 10 rows of the Original df Dataset
InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	X
536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	29-Nov-16	2.55	17850	United Kingdom	NA
536365	71053	WHITE METAL LANTERN	6	29-Nov-16	3.39	17850	United Kingdom	NA
536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	29-Nov-16	2.75	17850	United Kingdom	NA
536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	29-Nov-16	3.39	17850	United Kingdom	NA
536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	29-Nov-16	3.39	17850	United Kingdom	NA
536365	22752	SET 7 BABUSHKA NESTING BOXES	2	29-Nov-16	7.65	17850	United Kingdom	NA
536365	21730	GLASS STAR FROSTED T-LIGHT HOLDER	6	29-Nov-16	4.25	17850	United Kingdom	NA
536366	22633	HAND WARMER UNION JACK	6	29-Nov-16	1.85	17850	United Kingdom	NA
536366	22632	HAND WARMER RED POLKA DOT	6	29-Nov-16	1.85	17850	United Kingdom	NA
536367	84879	ASSORTED COLOUR BIRD ORNAMENT	32	29-Nov-16	1.69	13047	United Kingdom	NA

str(df)

## 'data.frame':    541909 obs. of  9 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: chr  "29-Nov-16" "29-Nov-16" "29-Nov-16" "29-Nov-16" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  $ X          : logi  NA NA NA NA NA NA ...

colnames(df)

## [1] "InvoiceNo"   "StockCode"   "Description" "Quantity"    "InvoiceDate"
## [6] "UnitPrice"   "CustomerID"  "Country"     "X"

From the initial exploration of the data set table (1) above, you do realize that there exists one extra column named, X that only contains NA values highlighted in light blue. Cross-checking with the original data set requirements above, this is not supposed to be part of the data set and thus was dropped the column from the data set. We also changed the InvoiceDate column to data type format, columns; InvoiceNo, StockCode, Description, CustomerID & Country

We do make a copy of the original imported data set. We shall subsequently make all changes to this duplicate copy. This is a precautionary practice to keep an original that we can revert back to in case something goes wrong with the duplicate copy.

dfCopy <-copy(df)
# Remove the last undesired column
dfCopy <- dplyr::select(dfCopy, -c(9))
# Create a vector object of all columns supposed to be converted to Categorical Data Type
categoricalColumns <- c('InvoiceNo', 'StockCode', 'Description', 'CustomerID', 'Country')
dfCopy[categoricalColumns] <- lapply(dfCopy[categoricalColumns], as.factor)
dfCopy$InvoiceDate <- as.Date(dfCopy$InvoiceDate, "%d-%b-%y")
kbl(head(dfCopy, 10), 
    caption = "Table 2 showing 1st 10 rows of the Modified dfCopy Dataset",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)

Table 2 showing 1st 10 rows of the Modified dfCopy Dataset
InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2016-11-29	2.55	17850	United Kingdom
536365	71053	WHITE METAL LANTERN	6	2016-11-29	3.39	17850	United Kingdom
536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2016-11-29	2.75	17850	United Kingdom
536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2016-11-29	3.39	17850	United Kingdom
536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2016-11-29	3.39	17850	United Kingdom
536365	22752	SET 7 BABUSHKA NESTING BOXES	2	2016-11-29	7.65	17850	United Kingdom
536365	21730	GLASS STAR FROSTED T-LIGHT HOLDER	6	2016-11-29	4.25	17850	United Kingdom
536366	22633	HAND WARMER UNION JACK	6	2016-11-29	1.85	17850	United Kingdom
536366	22632	HAND WARMER RED POLKA DOT	6	2016-11-29	1.85	17850	United Kingdom
536367	84879	ASSORTED COLOUR BIRD ORNAMENT	32	2016-11-29	1.69	13047	United Kingdom

str(dfCopy)

## 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
##  $ StockCode  : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
##  $ Description: Factor w/ 4224 levels "Dotcomgiftshop Gift Voucher \xa320.00",..: 4025 4033 936 1957 2978 3233 1571 1696 1693 263 ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: Date, format: "2016-11-29" "2016-11-29" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : Factor w/ 4372 levels "12346","12347",..: 4049 4049 4049 4049 4049 4049 4049 4049 4049 541 ...
##  $ Country    : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...

We further transformed the following columns InvoiceNo, StockCode, Description, CustomerID, Country to categorical variables and the InvoiceDate into a datetime variable. See table 2 above for context.

2.2 Data Cleaning

Check and Handle Missing Values

The dfCopy data frame has to be further checked for missing values especially NA values in each of the columns. The aim to to remove any NA values under the primary key columns i.e., CustomerID that can be used to track the transaction data of each of customer across the time. Since all columns have zero NA values and only the CUstomerID column has 135,080 missing values. This accounts for 25% of the entire data frame. Since CustomerID column contains a unique Customer ID that identifies a customer (Primary Key), it is illogical to impute the missing values in some way and thus remove all rows with missing values.

# Check for missing values under each of the data frame columns
colSums(is.na(dfCopy))

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country 
##      135080           0

gg_miss_var(dfCopy)

# Drop all rows under CustomerID column with missing values
dfCopy <- dfCopy %>%
  tidyr::drop_na(CustomerID)

Check and Handle Duplicates, Illogical values

SumOfDuplicates <- sum(duplicated(dfCopy))
count_starting_with_C <- sum(startsWith(as.character(dfCopy$InvoiceNo), 'C'))
cleanedDf <- dfCopy %>%
  mutate(InvNo_len = nchar(as.character(InvoiceNo))) %>%
  dplyr::filter(InvNo_len <= 6) %>%
  dplyr::distinct(.keep_all = TRUE)

Check for duplicate rows and remove them. It was discovered that there are 5226 duplicate rows and these were removed. We further removed 8905 values under the Invoice ID column that started with letter ‘c’. This is because from the original data set these values meant a cancellation happened thus a transaction was not completed. During the exploration we discovered that there were illogical values under each of the main columns like the Unit Price and Quantity columns. Furthermore, it was discovered that all unit prices and quantities under the unit price or quantity columns did not make logical sense for this particular assignment since they were negatives.

Before, this stage of the preliminary analysis, the data frame was composed of 406829observations. After, the cleaning processes above, we remained with a data frame composed of 392731 observations (See table 3 below). We further crosschecked the cleaned data frame for duplicates and illogical values below and zero sums were returned.

count_start_with_C <- sum(startsWith(as.character(cleanedDf$InvoiceNo), 'C'))
SumOfDups_After <- sum(duplicated(cleanedDf))
kbl(head(cleanedDf, 10), 
    caption = "Table 3 showing 1st 10 rows of the Modified cleanedDf Dataset",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)

Table 3 showing 1st 10 rows of the Modified cleanedDf Dataset
InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvNo_len
536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2016-11-29	2.55	17850	United Kingdom	6
536365	71053	WHITE METAL LANTERN	6	2016-11-29	3.39	17850	United Kingdom	6
536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2016-11-29	2.75	17850	United Kingdom	6
536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2016-11-29	3.39	17850	United Kingdom	6
536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2016-11-29	3.39	17850	United Kingdom	6
536365	22752	SET 7 BABUSHKA NESTING BOXES	2	2016-11-29	7.65	17850	United Kingdom	6
536365	21730	GLASS STAR FROSTED T-LIGHT HOLDER	6	2016-11-29	4.25	17850	United Kingdom	6
536366	22633	HAND WARMER UNION JACK	6	2016-11-29	1.85	17850	United Kingdom	6
536366	22632	HAND WARMER RED POLKA DOT	6	2016-11-29	1.85	17850	United Kingdom	6
536367	84879	ASSORTED COLOUR BIRD ORNAMENT	32	2016-11-29	1.69	13047	United Kingdom	6

2.3 Data Exploration

Conducted quick summary of the descriptive statistics on each of the cleanedDf data object to get a better understanding of the structure of our data under each of the columns.

From my summary I learned particularly for column unit price that had zero values. Under the Country column, I did also notice that there are country names presented as abbreviations. Under the country column, 38 different unique countries were presented and of these three; USA, RSA & EIRE were country names presented as abbreviations. One column under the same column was labelled Unspecified.

Since the percentage of rows that were labelled Unspecified accounted for less than 1% of the total data, we took a deliberate decision to remove all rows labelled, Unspecified. About the country abbreviations, I undertook some research and established that EIRE matches to the Irish Gaelic name for Ireland, RSA matches to South Africa and USA matches to United States of America. We replaced the mentioned country abbreviations as is below.

Additionally, from the summary statistics in table 4 below we noticed that there were zero values under the Unit Price column. We decided to remove these since they could be representing order returns or errors I decided to remove all rows with zero values under the unit price column.

summary(cleanedDf) %>%
  kbl(caption = "Table 4 showing summary statistics for cleanedDf") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)

Table 4 showing summary statistics for cleanedDf
InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvNo_len
576339 : 542	85123A : 2023	WHITE HANGING HEART T-LIGHT HOLDER: 2016	Min. : 1.00	Min. :2016-11-29	Min. : 0.000	17841 : 7676	United Kingdom:349226	Min. :6
579196 : 533	22423 : 1714	REGENCY CAKESTAND 3 TIER : 1714	1st Qu.: 2.00	1st Qu.:2017-04-05	1st Qu.: 1.250	14911 : 5672	Germany : 9027	1st Qu.:6
580727 : 529	85099B : 1615	JUMBO BAG RED RETROSPOT : 1615	Median : 6.00	Median :2017-07-29	Median : 1.950	14096 : 5111	France : 8327	Median :6
578270 : 442	84879 : 1395	ASSORTED COLOUR BIRD ORNAMENT : 1395	Mean : 13.15	Mean :2017-07-08	Mean : 3.126	12748 : 4413	EIRE : 7228	Mean :6
573576 : 435	47566 : 1390	PARTY BUNTING : 1390	3rd Qu.: 12.00	3rd Qu.:2017-10-18	3rd Qu.: 3.750	14606 : 2677	Spain : 2480	3rd Qu.:6
567656 : 421	20725 : 1304	LUNCH BAG RED RETROSPOT : 1303	Max. :80995.00	Max. :2017-12-07	Max. :8142.750	15311 : 2366	Netherlands : 2363	Max. :6
(Other):389829	(Other):383290	(Other) :383298	NA	NA	NA	(Other):364816	(Other) : 14080	NA

cleanedDf %>%
  group_by(Country) %>%
  tally() %>%
ungroup() %>%
  arrange(desc(n)) %>%
  kbl(caption = "Table 5 Showing Customer Count Aggregated by Country") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12) %>%
  scroll_box(width = "100%", height = "300px")

Table 5 Showing Customer Count Aggregated by Country
Country	n
United Kingdom	349226
Germany	9027
France	8327
EIRE	7228
Spain	2480
Netherlands	2363
Belgium	2031
Switzerland	1842
Portugal	1453
Australia	1184
Norway	1072
Italy	758
Channel Islands	747
Finland	685
Cyprus	603
Sweden	450
Austria	398
Denmark	380
Poland	330
Japan	321
Israel	245
Unspecified	241
Singapore	222
Iceland	182
USA	179
Canada	151
Greece	145
Malta	112
United Arab Emirates	68
European Community	60
RSA	58
Lebanon	45
Lithuania	35
Brazil	32
Czech Republic	25
Bahrain	17
Saudi Arabia	9

# Replace all rows in Country column that meets criteria in rep_str
rep_str <- c('EIRE'='Ireland', 'USA'='United States of America', 'RSA'='South Africa')
cleanedDf <- cleanedDf %>%
  mutate(Country = stringr::str_replace_all(Country, rep_str)) %>%
  filter(Country != 'Unspecified', UnitPrice != 0.000) %>%
  mutate(Country = fct_relevel(Country), 
         Revenue = Quantity * UnitPrice,
         Recency_days = as.Date("2017-12-08") - InvoiceDate) %>%
  dplyr::select(-c(2,3,9))

3.0 Feature Creation

From the above code pipeline, we further created a Revenue column to support us during the scoring process for the Monetary analysis i.e., how much each of the customers transacted in the given time frame. Calculation for the Revenue was done by multiplying the unit price and quantity columns for each of the rows in the data object. Furthermore, we created the Time_Dist column to support us during the weighting process for recency analysis.

4.0 Feature Selection

Borrowing from the methodological framework of RFM Analysis, only kept the following columns; Invoice No, Quantity, Invoice Date, Unit Price, CustomerID, Country, Time_Dist and Revenue.

In order to create the RFM data set to be used for further analysis, we grouped the data set by unique customer id and created three additional columns namely; Total Revenue, Recency Days and Total Orders. The Total Revenue column was calculated from the sum of all revenues by a given customer, Recency days column is calculated from the most recent date of transaction purchase date (minimum value under Time_Dist column) and lastly the Total Orders column is calculated from the total number of transaction orders done by a particular customer i.e., Row count under each of the customer group. The transformed RFM data set is shown below under table 6.

df_rfm <- cleanedDf %>%
  group_by(CustomerID) %>%
  summarise(
    transaction_count = n(),
    amount = sum(Revenue),
    Recency_days = min(Recency_days)
  ) %>%
  ungroup()

kbl(head(df_rfm, 10), 
    caption = "Table 6 showing Modified Dataset for RFM Analysis",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)

Table 6 showing Modified Dataset for RFM Analysis
CustomerID	transaction_count	amount	Recency_days
12346	1	77183.60	326 days
12347	182	4310.00	3 days
12348	31	1797.24	76 days
12349	73	1757.55	19 days
12350	17	334.40	311 days
12352	85	2506.04	37 days
12353	4	89.00	205 days
12354	58	1079.40	233 days
12355	13	459.40	215 days
12356	59	2811.43	23 days

4.1 Further Data Exploration

We further explored the distribution of the of the above transaction count, amount and Recency Days columns and they were all highly skewed to the left.

df_rfm %>% 
  ggplot(aes(x = transaction_count)) +
  geom_density(color="darkblue", fill="lightblue") +
  theme_minimal() +
  labs(subtitle = "Frequency Distribution Plot for Transaction Count",
       x = "Number of Transactions",
       caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(
    panel.grid = element_blank(),
    panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10))

df_rfm %>% 
  ggplot(aes(x = amount)) +
  geom_density(color="darkblue", fill="lightblue") +
  theme_minimal() +
  labs(subtitle = "Frequency Distribution Plot for Total Revenue",
       x = "Total Revenue",
       caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(
    panel.grid = element_blank(),
    panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10))

df_rfm %>% 
  ggplot(aes(x = Recency_days)) +
  geom_density(color="darkblue", fill="lightblue") +
  theme_minimal() +
  labs(subtitle = "Frequency Distribution Plot for Minimum Recency Days",
       x = "Minimum Last Date of Purchase",
       caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(
    panel.grid = element_blank(),
    panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10))

## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

5.0 Apply RFM Customer Segment Technique

analysis_date = as.Date("2017-12-08")
tmp_table <- rfm_table_customer(data = df_rfm,
                               customer_id = CustomerID,
                               n_transactions = transaction_count,
                               recency_days = Recency_days,
                               total_revenue = amount,
                               analysis_date = analysis_date)
rfm_table <- as.data.frame(tmp_table$rfm)

Based on the RFM Score created for each of the customers under the rfm_table, we further segmented the customers into 11 distinctive segments. Segments were based on the table outlined in this article. In our context all Customers labelled ‘Champions’ were our VIP group that we need to identify and further characterize.

segment_names <- c("VIPs", "Loyal Customers", "Potential Loyalist",
  "New Customers", "Promising", "Need Attention", "About To Sleep",
  "At Risk", "Can't Lose Them", "Lost")

recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)

rfm_tab2 <- rfm_segment(tmp_table, segment_names, recency_lower, recency_upper,
frequency_lower, frequency_upper, monetary_lower, monetary_upper)
kbl(head(rfm_tab2, 10),
    caption = "Table 7 showing Customer Segmentation based on RFM Scores",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)

Table 7 showing Customer Segmentation based on RFM Scores
customer_id	segment	rfm_score	transaction_count	recency_days	amount	recency_score	frequency_score	monetary_score
12346	Others	115	1	326 days	77183.60	1	1	5
12347	VIPs	555	182	3 days	4310.00	5	5	5
12348	Loyal Customers	234	31	76 days	1797.24	2	3	4
12349	VIPs	444	73	19 days	1757.55	4	4	4
12350	At Risk	122	17	311 days	334.40	1	2	2
12352	Loyal Customers	345	85	37 days	2506.04	3	4	5
12353	Lost	111	4	205 days	89.00	1	1	1
12354	At Risk	144	58	233 days	1079.40	1	4	4
12355	Lost	112	13	215 days	459.40	1	1	2
12356	VIPs	445	59	23 days	2811.43	4	4	5

6.0 Analyse Results & Characterize Segment

6.1 Analysis of Results

Now that we have defined and segmented our customers, we examined the distribution of customers across the segments using a Tree Map. This provided clarity on the VIP segment of this E-commence business.

rfm_tab2 %>%
  group_by(segment) %>%
  summarise(
    Count = n()
  ) %>%
  ungroup() %>%
  ggplot(aes(area = Count, fill = Count, label = paste(segment, Count, sep = "\n"))) + 
  geom_treemap() +
  geom_treemap_text(place = "centre", size = 10, colour = "white") + 
  labs(subtitle = "Customer Segmentation based on RFM Scores",
           caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}"))

From the visualization above, it is evident that high-value (VIPs) customers take up 21 percent of the overall customer segment. Most customers of this E-commence business are classified as loyal customers taking up 27 percent. Meaning they spend good money and respond to promotions well.

We further a profiled the customer segments to understand the distribution of their recency, purchase frequency and revenue contribution during the period.

median_recency_plot

## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

From the bar graph above it is clear that VIPs had purchased from the E-Commence store recently while the Lost Customer segment had taken a long time since they did purchase from the E-Commence store.

median_frequency_plot

Moving right to the bar graph visualizing the median frequency by segment. Again, it is noticable that the VIPs made the highest purchases recurrently but also contributed the most to the revenue of the E-commence business (Ref to Median Monetary Value by Segment graph below)

median_monetary_plot

6.2 Characterization of VIP Customer Segment

Ran some summary descriptive statistics on the filtered VIP customer segment and:

The minimum number of transactions for all customers under this segment was 58 transactions totaling to a minimum of 935 pounds during this reporting period.
On average each VIP customer overall spent 2,668 pounds and a maximum of 280,206 pounds.

rfm_tab2 %>% filter(segment == "VIPs") %>% summary() %>%
  kbl(caption = "Table 8 showing summary statistics for rfm_table") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)

Table 8 showing summary statistics for rfm_table
customer_id	segment	rfm_score	transaction_count	recency_days	amount	recency_score	frequency_score	monetary_score
12347 : 1	Length:918	Min. :444.0	Min. : 58.00	Length:918	Min. : 935	Min. :4.000	Min. :4.000	Min. :4.000
12349 : 1	Class :character	1st Qu.:455.0	1st Qu.: 97.25	Class :difftime	1st Qu.: 1682	1st Qu.:4.000	1st Qu.:4.000	1st Qu.:4.000
12356 : 1	Mode :character	Median :544.0	Median : 162.00	Mode :numeric	Median : 2668	Median :5.000	Median :5.000	Median :5.000
12362 : 1	NA	Mean :509.2	Mean : 256.59	NA	Mean : 6112	Mean :4.582	Mean :4.636	Mean :4.642
12364 : 1	NA	3rd Qu.:555.0	3rd Qu.: 281.50	NA	3rd Qu.: 4808	3rd Qu.:5.000	3rd Qu.:5.000	3rd Qu.:5.000
12380 : 1	NA	Max. :555.0	Max. :7676.00	NA	Max. :280206	Max. :5.000	Max. :5.000	Max. :5.000
(Other):912	NA	NA	NA	NA	NA	NA	NA	NA

vip_table <- rfm_tab2 %>%
  filter(segment == "VIPs") %>%
  inner_join(dfCopy, by = c("customer_id" = "CustomerID")) %>%
  mutate(recency_days = as.numeric(recency_days),
         Revenue = Quantity * UnitPrice,
         Country = stringr::str_replace_all(Country, rep_str),
         Country = fct_relevel(Country),
         Month = lubridate::month(InvoiceDate, label=TRUE, abbr=FALSE),
         Year = year(InvoiceDate)) %>%
  select(-c(3,4,5,6,7,8,9))

kbl(head(vip_table, 10),
    caption = "Table 9 showing VIP Segment Customers",
    align = "c"
      ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)

Table 9 showing VIP Segment Customers
customer_id	segment	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	Country	Revenue	Month	Year
12347	VIPs	537626	85116	BLACK CANDELABRA T-LIGHT HOLDER	12	2016-12-05	2.10	Iceland	25.2	December	2016
12347	VIPs	537626	22375	AIRLINE BAG VINTAGE JET SET BROWN	4	2016-12-05	4.25	Iceland	17.0	December	2016
12347	VIPs	537626	71477	COLOUR GLASS. STAR T-LIGHT HOLDER	12	2016-12-05	3.25	Iceland	39.0	December	2016
12347	VIPs	537626	22492	MINI PAINT SET VINTAGE	36	2016-12-05	0.65	Iceland	23.4	December	2016
12347	VIPs	537626	22771	CLEAR DRAWER KNOB ACRYLIC EDWARDIAN	12	2016-12-05	1.25	Iceland	15.0	December	2016
12347	VIPs	537626	22772	PINK DRAWER KNOB ACRYLIC EDWARDIAN	12	2016-12-05	1.25	Iceland	15.0	December	2016
12347	VIPs	537626	22773	GREEN DRAWER KNOB ACRYLIC EDWARDIAN	12	2016-12-05	1.25	Iceland	15.0	December	2016
12347	VIPs	537626	22774	RED DRAWER KNOB ACRYLIC EDWARDIAN	12	2016-12-05	1.25	Iceland	15.0	December	2016
12347	VIPs	537626	22775	PURPLE DRAWERKNOB ACRYLIC EDWARDIAN	12	2016-12-05	1.25	Iceland	15.0	December	2016
12347	VIPs	537626	22805	BLUE DRAWER KNOB ACRYLIC EDWARDIAN	12	2016-12-05	1.25	Iceland	15.0	December	2016

summary(vip_table) %>%
  kbl(caption = "Table 10 showing summary statistics for VIP Customer table") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                position = "center", font_size = 12)

Table 10 showing summary statistics for VIP Customer table
customer_id	segment	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	Country	Revenue	Month	Year
17841 : 7983	Length:243842	576339 : 542	85123A : 1171	WHITE HANGING HEART T-LIGHT HOLDER: 1165	Min. :-9360.00	Min. :2016-11-29	Min. : 0.00	United Kingdom:213984	Min. :-3825.36	November :46587	Min. :2016
14911 : 5903	Class :character	579196 : 533	85099B : 1124	JUMBO BAG RED RETROSPOT : 1124	1st Qu.: 2.00	1st Qu.:2017-04-12	1st Qu.: 1.25	Ireland : 7323	1st Qu.: 4.20	October :27860	1st Qu.:2017
14096 : 5128	Mode :character	580727 : 529	22423 : 1093	REGENCY CAKESTAND 3 TIER : 1093	Median : 5.00	Median :2017-08-05	Median : 1.95	Germany : 6807	Median : 11.40	September:23777	Median :2017
12748 : 4642	NA	578270 : 442	20725 : 978	LUNCH BAG RED RETROSPOT : 977	Mean : 12.88	Mean :2017-07-13	Mean : 3.16	France : 5733	Mean : 22.38	December :23742	Mean :2017
14606 : 2782	NA	573576 : 435	20727 : 804	LUNCH BAG BLACK SKULL. : 804	3rd Qu.: 12.00	3rd Qu.:2017-10-25	3rd Qu.: 3.75	Netherlands : 2152	3rd Qu.: 19.80	May :18238	3rd Qu.:2017
15311 : 2491	NA	567656 : 421	47566 : 793	PARTY BUNTING : 793	Max. : 4800.00	Max. :2017-12-07	Max. :3155.95	Spain : 1627	Max. : 7144.72	August :18018	Max. :2017
(Other):214913	NA	(Other):240940	(Other):237879	(Other) :237886	NA	NA	NA	(Other) : 6216	NA	(Other) :85620	NA

We further investigated which countries most of the high-value customers came from. Most of the high-value Customers were from United Kingdom, followed by Ireland and German in this order. Get more information from the visualization below.

vip_table %>%
  group_by(Country) %>%
  summarise(
    Count = n()
  ) %>%
  ungroup() %>%
  ggplot(aes(x = Count, y = reorder(Country, Count))) +
           geom_col(width = 0.7, fill = "lightblue") +
           theme_minimal() +
           labs(subtitle = "VIP Customers Disaggregated by Country",
           x = "Count",
           y = "Country",
           caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(
    panel.grid = element_blank(),
    panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10))

White Hanging Heart T-Light Holder, Jumbo Bag Red RetroSpot and Regency Cakestand 3 Tie were the most popularly purchased product items by this VIP customer segment.

Lastly, since this is a gift shop sort of E-commence business, it would be nice to know what months or season of the year this high-value customer segment makes the most purchases.

vip_table %>%
  filter(Year == 2017) %>%
  group_by(Month) %>%
  summarise(
    Total_Revenue = sum(Revenue),
    Total_Invoices = length(unique(InvoiceNo))
  ) %>%
  ungroup() %>%
  ggplot(aes(x = Total_Invoices, y = Month, size = Total_Revenue, color = Month)) +
  geom_point(alpha=0.5) +
  scale_size(range = c(.1, 24), name="Total Revenue") +
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A") +
  theme_minimal() +
  theme(
    legend.position="bottom",
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
    text = element_text(size = 10),
    axis.text.x = element_text(hjust = 0.6, size = 10),
    axis.text.y = element_text(hjust = 0.6, size = 10)) +
  labs(subtitle = " Plot showing Sales per Month Aggregated by Revenue",
           x = "Total Monthly Sales",
           y = "Months of the Year",
           caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
  theme(legend.position="none")

From the above plot, it is evident that most of the customers clustered under the high-value or VIP segment of this project make the most purchases and generate the most revenue during the last quarter of the year i.e., Months of October, November and December.

7.0 Conclusion

We can use promotional marketing strategies like discounts on commonly purchased items or discounts applied if a certain total invoice sum is surpassed during the last quarter of the year as a nudge to entice this segment of customers to keep around but also entice the loyal customer segment to join the high-value customers. During the other seasons of the year, we can use cues to reminder, to remind this segment of customers of existing exiting deals, products and offers.