1.0 Determine the client business needs
This project used the Single Discrete variable customer segmentation technique to achieve its objective. More specifically, the project used according to Aravind, H. (2023) the RFM (recency, frequency, monetary) analysis. This is a behavioural based technique used to segment customers by examining their transaction history as to:
- how recently a customer has purchased (recency)
- how often they purchase (frequency)
- how much the customer spends (monetary)
Essentially, the methodological framework is based off the pareto principle that: 80 percent of your business comes from 20 percent of your customers. The RFM analysis combines the above three customer attributes to rank customers i.e., If they bought in the recent past, they get higher points. If they bought many times, they get higher scores. And if they spent bigger, they get more points. Combined together, these three scores create the Recency-Frequency-Monetary (RFM) score.
Finally, I segmented the customer database into different groups based on this RFM score.
1.1 Project Objective
This project intended:
To strategically assist an E-Commence enterprise in identifying and characterising a select group of high-value customers (VIPs), ensuring that this chosen cluster of possesses substantial economic potential for targeted marketing efforts.
1.2 Project Questions
Therefore, the project assignment was guided by the following questions:
- Which group of customers make up the high-value customer (VIPs) segment?
- What characteristics are common amongst the above select group of high-value customers?
1.3 Implementation Process
- Determine the client business needs
- Data Sourcing, Cleaning & Exploration
- Feature Creation
- Feature Selection
- Apply RFM Customer Segment Technique
- Analyse Results & Characterize Segment
1.4 About the Data
The Online retail data set was obtained from UCI Machine Learning Repository. The data set contains transaction data occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered online retail store. The company mainly sells unique all-occasion gifts. NB: Many customers of the company are wholesalers.
2.0 Data Sourcing, Cleaning & Exploration
2.1 Data Sourcing
Import all Necessary Libraries
Installed and loaded all necessary packages for this project. For Data cleaning and exploration, tidyverse was used. GGplot, treemapify and naniar were used for visualization of, general plots, 3D Matrix plots and missing values respectively. Knitr and PrettyDoc was used for R-Markdown templating.
Set Working Directory
Initial Data Exploration
An initial exploration of the columns of the data set, type of the data object under each of the columns was done. This is required to understand if the columns and the data under each of the columns meets the overall data set features and variable types.
kbl(head(df, 10),
caption = "Table 1 showing 1st 10 rows of the Original df Dataset",
align = "c"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12) %>%
column_spec(9, background = "lightblue")| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | X |
|---|---|---|---|---|---|---|---|---|
| 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 29-Nov-16 | 2.55 | 17850 | United Kingdom | NA |
| 536365 | 71053 | WHITE METAL LANTERN | 6 | 29-Nov-16 | 3.39 | 17850 | United Kingdom | NA |
| 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 29-Nov-16 | 2.75 | 17850 | United Kingdom | NA |
| 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 29-Nov-16 | 3.39 | 17850 | United Kingdom | NA |
| 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 29-Nov-16 | 3.39 | 17850 | United Kingdom | NA |
| 536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 29-Nov-16 | 7.65 | 17850 | United Kingdom | NA |
| 536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 29-Nov-16 | 4.25 | 17850 | United Kingdom | NA |
| 536366 | 22633 | HAND WARMER UNION JACK | 6 | 29-Nov-16 | 1.85 | 17850 | United Kingdom | NA |
| 536366 | 22632 | HAND WARMER RED POLKA DOT | 6 | 29-Nov-16 | 1.85 | 17850 | United Kingdom | NA |
| 536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 29-Nov-16 | 1.69 | 13047 | United Kingdom | NA |
## 'data.frame': 541909 obs. of 9 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: chr "29-Nov-16" "29-Nov-16" "29-Nov-16" "29-Nov-16" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
## $ X : logi NA NA NA NA NA NA ...
## [1] "InvoiceNo" "StockCode" "Description" "Quantity" "InvoiceDate"
## [6] "UnitPrice" "CustomerID" "Country" "X"
From the initial exploration of the data set table (1) above, you do realize that there exists one extra column named, X that only contains NA values highlighted in light blue. Cross-checking with the original data set requirements above, this is not supposed to be part of the data set and thus was dropped the column from the data set. We also changed the InvoiceDate column to data type format, columns; InvoiceNo, StockCode, Description, CustomerID & Country
We do make a copy of the original imported data set. We shall subsequently make all changes to this duplicate copy. This is a precautionary practice to keep an original that we can revert back to in case something goes wrong with the duplicate copy.
dfCopy <-copy(df)
# Remove the last undesired column
dfCopy <- dplyr::select(dfCopy, -c(9))
# Create a vector object of all columns supposed to be converted to Categorical Data Type
categoricalColumns <- c('InvoiceNo', 'StockCode', 'Description', 'CustomerID', 'Country')
dfCopy[categoricalColumns] <- lapply(dfCopy[categoricalColumns], as.factor)
dfCopy$InvoiceDate <- as.Date(dfCopy$InvoiceDate, "%d-%b-%y")
kbl(head(dfCopy, 10),
caption = "Table 2 showing 1st 10 rows of the Modified dfCopy Dataset",
align = "c"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12)| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
|---|---|---|---|---|---|---|---|
| 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2016-11-29 | 2.55 | 17850 | United Kingdom |
| 536365 | 71053 | WHITE METAL LANTERN | 6 | 2016-11-29 | 3.39 | 17850 | United Kingdom |
| 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2016-11-29 | 2.75 | 17850 | United Kingdom |
| 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2016-11-29 | 3.39 | 17850 | United Kingdom |
| 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2016-11-29 | 3.39 | 17850 | United Kingdom |
| 536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 2016-11-29 | 7.65 | 17850 | United Kingdom |
| 536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 2016-11-29 | 4.25 | 17850 | United Kingdom |
| 536366 | 22633 | HAND WARMER UNION JACK | 6 | 2016-11-29 | 1.85 | 17850 | United Kingdom |
| 536366 | 22632 | HAND WARMER RED POLKA DOT | 6 | 2016-11-29 | 1.85 | 17850 | United Kingdom |
| 536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 2016-11-29 | 1.69 | 13047 | United Kingdom |
## 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : Factor w/ 25900 levels "536365","536366",..: 1 1 1 1 1 1 1 2 2 3 ...
## $ StockCode : Factor w/ 4070 levels "10002","10080",..: 3538 2795 3045 2986 2985 1663 801 1548 1547 3306 ...
## $ Description: Factor w/ 4224 levels "Dotcomgiftshop Gift Voucher \xa320.00",..: 4025 4033 936 1957 2978 3233 1571 1696 1693 263 ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: Date, format: "2016-11-29" "2016-11-29" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : Factor w/ 4372 levels "12346","12347",..: 4049 4049 4049 4049 4049 4049 4049 4049 4049 541 ...
## $ Country : Factor w/ 38 levels "Australia","Austria",..: 36 36 36 36 36 36 36 36 36 36 ...
We further transformed the following columns InvoiceNo, StockCode, Description, CustomerID, Country to categorical variables and the InvoiceDate into a datetime variable. See table 2 above for context.
2.2 Data Cleaning
Check and Handle Missing Values
The dfCopy data frame has to be further checked for missing values especially NA values in each of the columns. The aim to to remove any NA values under the primary key columns i.e., CustomerID that can be used to track the transaction data of each of customer across the time. Since all columns have zero NA values and only the CUstomerID column has 135,080 missing values. This accounts for 25% of the entire data frame. Since CustomerID column contains a unique Customer ID that identifies a customer (Primary Key), it is illogical to impute the missing values in some way and thus remove all rows with missing values.
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 0 0 0 0
## CustomerID Country
## 135080 0
Check and Handle Duplicates, Illogical values
SumOfDuplicates <- sum(duplicated(dfCopy))
count_starting_with_C <- sum(startsWith(as.character(dfCopy$InvoiceNo), 'C'))
cleanedDf <- dfCopy %>%
mutate(InvNo_len = nchar(as.character(InvoiceNo))) %>%
dplyr::filter(InvNo_len <= 6) %>%
dplyr::distinct(.keep_all = TRUE)Check for duplicate rows and remove them. It was discovered that there are 5226 duplicate rows and these were removed. We further removed 8905 values under the Invoice ID column that started with letter ‘c’. This is because from the original data set these values meant a cancellation happened thus a transaction was not completed. During the exploration we discovered that there were illogical values under each of the main columns like the Unit Price and Quantity columns. Furthermore, it was discovered that all unit prices and quantities under the unit price or quantity columns did not make logical sense for this particular assignment since they were negatives.
Before, this stage of the preliminary analysis, the data frame was composed of 406829observations. After, the cleaning processes above, we remained with a data frame composed of 392731 observations (See table 3 below). We further crosschecked the cleaned data frame for duplicates and illogical values below and zero sums were returned.
count_start_with_C <- sum(startsWith(as.character(cleanedDf$InvoiceNo), 'C'))
SumOfDups_After <- sum(duplicated(cleanedDf))
kbl(head(cleanedDf, 10),
caption = "Table 3 showing 1st 10 rows of the Modified cleanedDf Dataset",
align = "c"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12)| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | InvNo_len |
|---|---|---|---|---|---|---|---|---|
| 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2016-11-29 | 2.55 | 17850 | United Kingdom | 6 |
| 536365 | 71053 | WHITE METAL LANTERN | 6 | 2016-11-29 | 3.39 | 17850 | United Kingdom | 6 |
| 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2016-11-29 | 2.75 | 17850 | United Kingdom | 6 |
| 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2016-11-29 | 3.39 | 17850 | United Kingdom | 6 |
| 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2016-11-29 | 3.39 | 17850 | United Kingdom | 6 |
| 536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 2016-11-29 | 7.65 | 17850 | United Kingdom | 6 |
| 536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 2016-11-29 | 4.25 | 17850 | United Kingdom | 6 |
| 536366 | 22633 | HAND WARMER UNION JACK | 6 | 2016-11-29 | 1.85 | 17850 | United Kingdom | 6 |
| 536366 | 22632 | HAND WARMER RED POLKA DOT | 6 | 2016-11-29 | 1.85 | 17850 | United Kingdom | 6 |
| 536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 2016-11-29 | 1.69 | 13047 | United Kingdom | 6 |
2.3 Data Exploration
Conducted quick summary of the descriptive statistics on each of the cleanedDf data object to get a better understanding of the structure of our data under each of the columns.
From my summary I learned particularly for column unit price that had zero values. Under the Country column, I did also notice that there are country names presented as abbreviations. Under the country column, 38 different unique countries were presented and of these three; USA, RSA & EIRE were country names presented as abbreviations. One column under the same column was labelled Unspecified.
Since the percentage of rows that were labelled Unspecified accounted for less than 1% of the total data, we took a deliberate decision to remove all rows labelled, Unspecified. About the country abbreviations, I undertook some research and established that EIRE matches to the Irish Gaelic name for Ireland, RSA matches to South Africa and USA matches to United States of America. We replaced the mentioned country abbreviations as is below.
Additionally, from the summary statistics in table 4 below we noticed that there were zero values under the Unit Price column. We decided to remove these since they could be representing order returns or errors I decided to remove all rows with zero values under the unit price column.
summary(cleanedDf) %>%
kbl(caption = "Table 4 showing summary statistics for cleanedDf") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12)| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | InvNo_len | |
|---|---|---|---|---|---|---|---|---|---|
| 576339 : 542 | 85123A : 2023 | WHITE HANGING HEART T-LIGHT HOLDER: 2016 | Min. : 1.00 | Min. :2016-11-29 | Min. : 0.000 | 17841 : 7676 | United Kingdom:349226 | Min. :6 | |
| 579196 : 533 | 22423 : 1714 | REGENCY CAKESTAND 3 TIER : 1714 | 1st Qu.: 2.00 | 1st Qu.:2017-04-05 | 1st Qu.: 1.250 | 14911 : 5672 | Germany : 9027 | 1st Qu.:6 | |
| 580727 : 529 | 85099B : 1615 | JUMBO BAG RED RETROSPOT : 1615 | Median : 6.00 | Median :2017-07-29 | Median : 1.950 | 14096 : 5111 | France : 8327 | Median :6 | |
| 578270 : 442 | 84879 : 1395 | ASSORTED COLOUR BIRD ORNAMENT : 1395 | Mean : 13.15 | Mean :2017-07-08 | Mean : 3.126 | 12748 : 4413 | EIRE : 7228 | Mean :6 | |
| 573576 : 435 | 47566 : 1390 | PARTY BUNTING : 1390 | 3rd Qu.: 12.00 | 3rd Qu.:2017-10-18 | 3rd Qu.: 3.750 | 14606 : 2677 | Spain : 2480 | 3rd Qu.:6 | |
| 567656 : 421 | 20725 : 1304 | LUNCH BAG RED RETROSPOT : 1303 | Max. :80995.00 | Max. :2017-12-07 | Max. :8142.750 | 15311 : 2366 | Netherlands : 2363 | Max. :6 | |
| (Other):389829 | (Other):383290 | (Other) :383298 | NA | NA | NA | (Other):364816 | (Other) : 14080 | NA |
cleanedDf %>%
group_by(Country) %>%
tally() %>%
ungroup() %>%
arrange(desc(n)) %>%
kbl(caption = "Table 5 Showing Customer Count Aggregated by Country") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12) %>%
scroll_box(width = "100%", height = "300px")| Country | n |
|---|---|
| United Kingdom | 349226 |
| Germany | 9027 |
| France | 8327 |
| EIRE | 7228 |
| Spain | 2480 |
| Netherlands | 2363 |
| Belgium | 2031 |
| Switzerland | 1842 |
| Portugal | 1453 |
| Australia | 1184 |
| Norway | 1072 |
| Italy | 758 |
| Channel Islands | 747 |
| Finland | 685 |
| Cyprus | 603 |
| Sweden | 450 |
| Austria | 398 |
| Denmark | 380 |
| Poland | 330 |
| Japan | 321 |
| Israel | 245 |
| Unspecified | 241 |
| Singapore | 222 |
| Iceland | 182 |
| USA | 179 |
| Canada | 151 |
| Greece | 145 |
| Malta | 112 |
| United Arab Emirates | 68 |
| European Community | 60 |
| RSA | 58 |
| Lebanon | 45 |
| Lithuania | 35 |
| Brazil | 32 |
| Czech Republic | 25 |
| Bahrain | 17 |
| Saudi Arabia | 9 |
# Replace all rows in Country column that meets criteria in rep_str
rep_str <- c('EIRE'='Ireland', 'USA'='United States of America', 'RSA'='South Africa')
cleanedDf <- cleanedDf %>%
mutate(Country = stringr::str_replace_all(Country, rep_str)) %>%
filter(Country != 'Unspecified', UnitPrice != 0.000) %>%
mutate(Country = fct_relevel(Country),
Revenue = Quantity * UnitPrice,
Recency_days = as.Date("2017-12-08") - InvoiceDate) %>%
dplyr::select(-c(2,3,9))3.0 Feature Creation
From the above code pipeline, we further created a Revenue column to support us during the scoring process for the Monetary analysis i.e., how much each of the customers transacted in the given time frame. Calculation for the Revenue was done by multiplying the unit price and quantity columns for each of the rows in the data object. Furthermore, we created the Time_Dist column to support us during the weighting process for recency analysis.
4.0 Feature Selection
Borrowing from the methodological framework of RFM Analysis, only kept the following columns; Invoice No, Quantity, Invoice Date, Unit Price, CustomerID, Country, Time_Dist and Revenue.
In order to create the RFM data set to be used for further analysis, we grouped the data set by unique customer id and created three additional columns namely; Total Revenue, Recency Days and Total Orders. The Total Revenue column was calculated from the sum of all revenues by a given customer, Recency days column is calculated from the most recent date of transaction purchase date (minimum value under Time_Dist column) and lastly the Total Orders column is calculated from the total number of transaction orders done by a particular customer i.e., Row count under each of the customer group. The transformed RFM data set is shown below under table 6.
df_rfm <- cleanedDf %>%
group_by(CustomerID) %>%
summarise(
transaction_count = n(),
amount = sum(Revenue),
Recency_days = min(Recency_days)
) %>%
ungroup()
kbl(head(df_rfm, 10),
caption = "Table 6 showing Modified Dataset for RFM Analysis",
align = "c"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12)| CustomerID | transaction_count | amount | Recency_days |
|---|---|---|---|
| 12346 | 1 | 77183.60 | 326 days |
| 12347 | 182 | 4310.00 | 3 days |
| 12348 | 31 | 1797.24 | 76 days |
| 12349 | 73 | 1757.55 | 19 days |
| 12350 | 17 | 334.40 | 311 days |
| 12352 | 85 | 2506.04 | 37 days |
| 12353 | 4 | 89.00 | 205 days |
| 12354 | 58 | 1079.40 | 233 days |
| 12355 | 13 | 459.40 | 215 days |
| 12356 | 59 | 2811.43 | 23 days |
4.1 Further Data Exploration
We further explored the distribution of the of the above transaction count, amount and Recency Days columns and they were all highly skewed to the left.
df_rfm %>%
ggplot(aes(x = transaction_count)) +
geom_density(color="darkblue", fill="lightblue") +
theme_minimal() +
labs(subtitle = "Frequency Distribution Plot for Transaction Count",
x = "Number of Transactions",
caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
theme(
panel.grid = element_blank(),
panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
text = element_text(size = 10),
axis.text.x = element_text(hjust = 0.6, size = 10),
axis.text.y = element_text(hjust = 0.6, size = 10))df_rfm %>%
ggplot(aes(x = amount)) +
geom_density(color="darkblue", fill="lightblue") +
theme_minimal() +
labs(subtitle = "Frequency Distribution Plot for Total Revenue",
x = "Total Revenue",
caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
theme(
panel.grid = element_blank(),
panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
text = element_text(size = 10),
axis.text.x = element_text(hjust = 0.6, size = 10),
axis.text.y = element_text(hjust = 0.6, size = 10))df_rfm %>%
ggplot(aes(x = Recency_days)) +
geom_density(color="darkblue", fill="lightblue") +
theme_minimal() +
labs(subtitle = "Frequency Distribution Plot for Minimum Recency Days",
x = "Minimum Last Date of Purchase",
caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
theme(
panel.grid = element_blank(),
panel.grid.major = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
text = element_text(size = 10),
axis.text.x = element_text(hjust = 0.6, size = 10),
axis.text.y = element_text(hjust = 0.6, size = 10))## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
5.0 Apply RFM Customer Segment Technique
analysis_date = as.Date("2017-12-08")
tmp_table <- rfm_table_customer(data = df_rfm,
customer_id = CustomerID,
n_transactions = transaction_count,
recency_days = Recency_days,
total_revenue = amount,
analysis_date = analysis_date)
rfm_table <- as.data.frame(tmp_table$rfm)Based on the RFM Score created for each of the customers under the rfm_table, we further segmented the customers into 11 distinctive segments. Segments were based on the table outlined in this article. In our context all Customers labelled ‘Champions’ were our VIP group that we need to identify and further characterize.
segment_names <- c("VIPs", "Loyal Customers", "Potential Loyalist",
"New Customers", "Promising", "Need Attention", "About To Sleep",
"At Risk", "Can't Lose Them", "Lost")
recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
rfm_tab2 <- rfm_segment(tmp_table, segment_names, recency_lower, recency_upper,
frequency_lower, frequency_upper, monetary_lower, monetary_upper)
kbl(head(rfm_tab2, 10),
caption = "Table 7 showing Customer Segmentation based on RFM Scores",
align = "c"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12)| customer_id | segment | rfm_score | transaction_count | recency_days | amount | recency_score | frequency_score | monetary_score |
|---|---|---|---|---|---|---|---|---|
| 12346 | Others | 115 | 1 | 326 days | 77183.60 | 1 | 1 | 5 |
| 12347 | VIPs | 555 | 182 | 3 days | 4310.00 | 5 | 5 | 5 |
| 12348 | Loyal Customers | 234 | 31 | 76 days | 1797.24 | 2 | 3 | 4 |
| 12349 | VIPs | 444 | 73 | 19 days | 1757.55 | 4 | 4 | 4 |
| 12350 | At Risk | 122 | 17 | 311 days | 334.40 | 1 | 2 | 2 |
| 12352 | Loyal Customers | 345 | 85 | 37 days | 2506.04 | 3 | 4 | 5 |
| 12353 | Lost | 111 | 4 | 205 days | 89.00 | 1 | 1 | 1 |
| 12354 | At Risk | 144 | 58 | 233 days | 1079.40 | 1 | 4 | 4 |
| 12355 | Lost | 112 | 13 | 215 days | 459.40 | 1 | 1 | 2 |
| 12356 | VIPs | 445 | 59 | 23 days | 2811.43 | 4 | 4 | 5 |
6.0 Analyse Results & Characterize Segment
6.1 Analysis of Results
Now that we have defined and segmented our customers, we examined the distribution of customers across the segments using a Tree Map. This provided clarity on the VIP segment of this E-commence business.
rfm_tab2 %>%
group_by(segment) %>%
summarise(
Count = n()
) %>%
ungroup() %>%
ggplot(aes(area = Count, fill = Count, label = paste(segment, Count, sep = "\n"))) +
geom_treemap() +
geom_treemap_text(place = "centre", size = 10, colour = "white") +
labs(subtitle = "Customer Segmentation based on RFM Scores",
caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}"))From the visualization above, it is evident that high-value (VIPs) customers take up 21 percent of the overall customer segment. Most customers of this E-commence business are classified as loyal customers taking up 27 percent. Meaning they spend good money and respond to promotions well.
We further a profiled the customer segments to understand the distribution of their recency, purchase frequency and revenue contribution during the period.
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
From the bar graph above it is clear that VIPs had purchased from the E-Commence store recently while the Lost Customer segment had taken a long time since they did purchase from the E-Commence store.
Moving right to the bar graph visualizing the median frequency by segment. Again, it is noticable that the VIPs made the highest purchases recurrently but also contributed the most to the revenue of the E-commence business (Ref to Median Monetary Value by Segment graph below)
6.2 Characterization of VIP Customer Segment
Ran some summary descriptive statistics on the filtered VIP customer segment and:
- The minimum number of transactions for all customers under this
segment was 58 transactions totaling to a minimum of
935 pounds during this reporting period.
- On average each VIP customer overall spent 2,668 pounds and a maximum of 280,206 pounds.
rfm_tab2 %>% filter(segment == "VIPs") %>% summary() %>%
kbl(caption = "Table 8 showing summary statistics for rfm_table") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12)| customer_id | segment | rfm_score | transaction_count | recency_days | amount | recency_score | frequency_score | monetary_score | |
|---|---|---|---|---|---|---|---|---|---|
| 12347 : 1 | Length:918 | Min. :444.0 | Min. : 58.00 | Length:918 | Min. : 935 | Min. :4.000 | Min. :4.000 | Min. :4.000 | |
| 12349 : 1 | Class :character | 1st Qu.:455.0 | 1st Qu.: 97.25 | Class :difftime | 1st Qu.: 1682 | 1st Qu.:4.000 | 1st Qu.:4.000 | 1st Qu.:4.000 | |
| 12356 : 1 | Mode :character | Median :544.0 | Median : 162.00 | Mode :numeric | Median : 2668 | Median :5.000 | Median :5.000 | Median :5.000 | |
| 12362 : 1 | NA | Mean :509.2 | Mean : 256.59 | NA | Mean : 6112 | Mean :4.582 | Mean :4.636 | Mean :4.642 | |
| 12364 : 1 | NA | 3rd Qu.:555.0 | 3rd Qu.: 281.50 | NA | 3rd Qu.: 4808 | 3rd Qu.:5.000 | 3rd Qu.:5.000 | 3rd Qu.:5.000 | |
| 12380 : 1 | NA | Max. :555.0 | Max. :7676.00 | NA | Max. :280206 | Max. :5.000 | Max. :5.000 | Max. :5.000 | |
| (Other):912 | NA | NA | NA | NA | NA | NA | NA | NA |
vip_table <- rfm_tab2 %>%
filter(segment == "VIPs") %>%
inner_join(dfCopy, by = c("customer_id" = "CustomerID")) %>%
mutate(recency_days = as.numeric(recency_days),
Revenue = Quantity * UnitPrice,
Country = stringr::str_replace_all(Country, rep_str),
Country = fct_relevel(Country),
Month = lubridate::month(InvoiceDate, label=TRUE, abbr=FALSE),
Year = year(InvoiceDate)) %>%
select(-c(3,4,5,6,7,8,9))
kbl(head(vip_table, 10),
caption = "Table 9 showing VIP Segment Customers",
align = "c"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12)| customer_id | segment | InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | Country | Revenue | Month | Year |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 12347 | VIPs | 537626 | 85116 | BLACK CANDELABRA T-LIGHT HOLDER | 12 | 2016-12-05 | 2.10 | Iceland | 25.2 | December | 2016 |
| 12347 | VIPs | 537626 | 22375 | AIRLINE BAG VINTAGE JET SET BROWN | 4 | 2016-12-05 | 4.25 | Iceland | 17.0 | December | 2016 |
| 12347 | VIPs | 537626 | 71477 | COLOUR GLASS. STAR T-LIGHT HOLDER | 12 | 2016-12-05 | 3.25 | Iceland | 39.0 | December | 2016 |
| 12347 | VIPs | 537626 | 22492 | MINI PAINT SET VINTAGE | 36 | 2016-12-05 | 0.65 | Iceland | 23.4 | December | 2016 |
| 12347 | VIPs | 537626 | 22771 | CLEAR DRAWER KNOB ACRYLIC EDWARDIAN | 12 | 2016-12-05 | 1.25 | Iceland | 15.0 | December | 2016 |
| 12347 | VIPs | 537626 | 22772 | PINK DRAWER KNOB ACRYLIC EDWARDIAN | 12 | 2016-12-05 | 1.25 | Iceland | 15.0 | December | 2016 |
| 12347 | VIPs | 537626 | 22773 | GREEN DRAWER KNOB ACRYLIC EDWARDIAN | 12 | 2016-12-05 | 1.25 | Iceland | 15.0 | December | 2016 |
| 12347 | VIPs | 537626 | 22774 | RED DRAWER KNOB ACRYLIC EDWARDIAN | 12 | 2016-12-05 | 1.25 | Iceland | 15.0 | December | 2016 |
| 12347 | VIPs | 537626 | 22775 | PURPLE DRAWERKNOB ACRYLIC EDWARDIAN | 12 | 2016-12-05 | 1.25 | Iceland | 15.0 | December | 2016 |
| 12347 | VIPs | 537626 | 22805 | BLUE DRAWER KNOB ACRYLIC EDWARDIAN | 12 | 2016-12-05 | 1.25 | Iceland | 15.0 | December | 2016 |
summary(vip_table) %>%
kbl(caption = "Table 10 showing summary statistics for VIP Customer table") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
position = "center", font_size = 12)| customer_id | segment | InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | Country | Revenue | Month | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17841 : 7983 | Length:243842 | 576339 : 542 | 85123A : 1171 | WHITE HANGING HEART T-LIGHT HOLDER: 1165 | Min. :-9360.00 | Min. :2016-11-29 | Min. : 0.00 | United Kingdom:213984 | Min. :-3825.36 | November :46587 | Min. :2016 | |
| 14911 : 5903 | Class :character | 579196 : 533 | 85099B : 1124 | JUMBO BAG RED RETROSPOT : 1124 | 1st Qu.: 2.00 | 1st Qu.:2017-04-12 | 1st Qu.: 1.25 | Ireland : 7323 | 1st Qu.: 4.20 | October :27860 | 1st Qu.:2017 | |
| 14096 : 5128 | Mode :character | 580727 : 529 | 22423 : 1093 | REGENCY CAKESTAND 3 TIER : 1093 | Median : 5.00 | Median :2017-08-05 | Median : 1.95 | Germany : 6807 | Median : 11.40 | September:23777 | Median :2017 | |
| 12748 : 4642 | NA | 578270 : 442 | 20725 : 978 | LUNCH BAG RED RETROSPOT : 977 | Mean : 12.88 | Mean :2017-07-13 | Mean : 3.16 | France : 5733 | Mean : 22.38 | December :23742 | Mean :2017 | |
| 14606 : 2782 | NA | 573576 : 435 | 20727 : 804 | LUNCH BAG BLACK SKULL. : 804 | 3rd Qu.: 12.00 | 3rd Qu.:2017-10-25 | 3rd Qu.: 3.75 | Netherlands : 2152 | 3rd Qu.: 19.80 | May :18238 | 3rd Qu.:2017 | |
| 15311 : 2491 | NA | 567656 : 421 | 47566 : 793 | PARTY BUNTING : 793 | Max. : 4800.00 | Max. :2017-12-07 | Max. :3155.95 | Spain : 1627 | Max. : 7144.72 | August :18018 | Max. :2017 | |
| (Other):214913 | NA | (Other):240940 | (Other):237879 | (Other) :237886 | NA | NA | NA | (Other) : 6216 | NA | (Other) :85620 | NA |
- We further investigated which countries most of the high-value customers came from. Most of the high-value Customers were from United Kingdom, followed by Ireland and German in this order. Get more information from the visualization below.
vip_table %>%
group_by(Country) %>%
summarise(
Count = n()
) %>%
ungroup() %>%
ggplot(aes(x = Count, y = reorder(Country, Count))) +
geom_col(width = 0.7, fill = "lightblue") +
theme_minimal() +
labs(subtitle = "VIP Customers Disaggregated by Country",
x = "Count",
y = "Country",
caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
theme(
panel.grid = element_blank(),
panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
text = element_text(size = 10),
axis.text.x = element_text(hjust = 0.6, size = 10),
axis.text.y = element_text(hjust = 0.6, size = 10))- White Hanging Heart T-Light Holder, Jumbo Bag Red RetroSpot and Regency Cakestand 3 Tie were the most popularly purchased product items by this VIP customer segment.
Lastly, since this is a gift shop sort of E-commence business, it would be nice to know what months or season of the year this high-value customer segment makes the most purchases.
vip_table %>%
filter(Year == 2017) %>%
group_by(Month) %>%
summarise(
Total_Revenue = sum(Revenue),
Total_Invoices = length(unique(InvoiceNo))
) %>%
ungroup() %>%
ggplot(aes(x = Total_Invoices, y = Month, size = Total_Revenue, color = Month)) +
geom_point(alpha=0.5) +
scale_size(range = c(.1, 24), name="Total Revenue") +
scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A") +
theme_minimal() +
theme(
legend.position="bottom",
panel.grid = element_blank(),
panel.grid.major.y = element_line(color = "lightgrey", linewidth = 0.05, linetype = 1),
text = element_text(size = 10),
axis.text.x = element_text(hjust = 0.6, size = 10),
axis.text.y = element_text(hjust = 0.6, size = 10)) +
labs(subtitle = " Plot showing Sales per Month Aggregated by Revenue",
x = "Total Monthly Sales",
y = "Months of the Year",
caption = stringr::str_glue("Data as of {as.Date('2017-12-08')}")) +
theme(legend.position="none")From the above plot, it is evident that most of the customers clustered under the high-value or VIP segment of this project make the most purchases and generate the most revenue during the last quarter of the year i.e., Months of October, November and December.
7.0 Conclusion
We can use promotional marketing strategies like discounts on commonly purchased items or discounts applied if a certain total invoice sum is surpassed during the last quarter of the year as a nudge to entice this segment of customers to keep around but also entice the loyal customer segment to join the high-value customers. During the other seasons of the year, we can use cues to reminder, to remind this segment of customers of existing exiting deals, products and offers.