1 Introduction

Note: Idea of Cohort Analysis and K-mean Clustering is sourcing from https://www.kaggle.com/mahmoudelfahl/cohort-analysis-customer-segmentation-with-rfm

Data Science is especially relevant in the e-commerce and retail industry. They can predict the purchases, profits, losses and even manipulate customers into buying things by tracking their behaviour. Retail brands analyse data to create customer profiles and learn his/her sore points and market their product accordingly to push the customer towards purchasing. Have you ever imagined what if there is no any recommendation products or products that pop out in your timelime are random? Will you do any further action, either (click) or purchasing?

Recommendation engines are the most important tools in a e-commerce platform.By using these engines, retails on e-commerces can increase their sales. Providing recommendation engine is one of the way how the online marketing does to improve conversion rate. For instance, let’s say there is a online buyer have purchased a smartphone in a e-commerce, there will be recommendation items (such as another brand of smartphone, earphone, etc) since then.

In retail, customers purchase items based on impulse, and we can work on this principle by predicting the chances of a customer’s purchasing behaviour. This mostly involves a lot of how the marketing of the product is done by the retailers, and in the world of e-commerce, customer data is the best place to look for potential buying impulses. Similar to search recommendations, market basket analysis also works with a machine learning or deep learning algorithm. (source: https://www.mygreatlearning.com/blog/applications-of-data-science-in-e-commerce-industry, accessed on April 19th, 2020)

One of measurement to indicate how well an online platform performs is the retention rate. By using Cohort Analysis, we can see the metrics of retention rate which indicate how many customers are still active since their first purchase. Retention rate will show you the percentage of active customers to the total number of customers.

As Marketing Manager, we need to know our customers very well. Knowing how many products they will buy, how long to take customers to repurchase again and how much customers have spent on our website will help us to make a good and specific strategy to maintain customers.

Knowing when behavior and number of purchasing is important too. By analyzing behavior of purchasing time of customers and forecast number of transaction, marketing team can decide when the best time is to do promotion. With do promotion right on target, we have done efficiency in marketing cost.

2 Cohort Analysis (Retention over User & Product Lifetime)

2.1 What is Cohort Analysis

A cohort is a group of subjects who share a defining characteristic. We can observe how a cohort behaves across time and compare it to other cohorts. Cohorts are used in medicine, psychology, econometrics, ecology and many other areas to perform a cross-section (compare difference across subjects) at intervals through time.

2.2 Type of Cohorts

  • Time Cohorts are customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customers’ behavior depending on the time they started using the company’s products or services. The time may be monthly or quarterly even daily.
  • Behaovior cohorts are customers who purchased a product or subscribed to a service in the past. It groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understaning the needs of the various cohorts can help a company design custom-made services or products for particular segments.
  • Size cohorts refer to the various sizes of customers who purchase company’s products or services. This categorization can be based on the amount of spending in some periodic time after acquisition or the product type that the customer spent most of their order amount in some period of time.

2.3 Exploratory Data Analysis

First of all, we should import data. This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

This dataset includes information about:
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
-InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
-UnitPrice: Unit price. Numeric, Product price per unit in sterling.
-CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
-Country: Country name. Nominal, the name of the country where each customer resides.

2.3.1 Data Wrangling

First step we have to do is checking whether there is NA (Not Available) or missing value in this data.

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0        1454           0           0           0 
##  CustomerID     Country 
##      135080           0

Since CustomerID is important for our model, we need to remove all the CustomerID with NA value

After finished removing all of missing values, we need to remove all duplicated data that will impact our modeling.

Getting finished in removing data, we move to next step to check summary of data. In this step, we verify whether there is anomaly data or not.

##   InvoiceNo          StockCode         Description           Quantity        
##  Length:401604      Length:401604      Length:401604      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     2.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     5.00  
##                                                           Mean   :    12.18  
##                                                           3rd Qu.:    12.00  
##                                                           Max.   : 80995.00  
##   InvoiceDate                    UnitPrice          CustomerID   
##  Min.   :2010-12-01 08:26:00   Min.   :    0.00   Min.   :12346  
##  1st Qu.:2011-04-06 15:02:00   1st Qu.:    1.25   1st Qu.:13939  
##  Median :2011-07-29 15:40:00   Median :    1.95   Median :15145  
##  Mean   :2011-07-10 12:08:23   Mean   :    3.47   Mean   :15281  
##  3rd Qu.:2011-10-20 11:58:30   3rd Qu.:    3.75   3rd Qu.:16784  
##  Max.   :2011-12-09 12:50:00   Max.   :38970.00   Max.   :18287  
##    Country         
##  Length:401604     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

The min for unit price = 0 and the min for Quantity with negative value. We need to subset data or select data with UnitPrice above ‘0’ and Quantity above ‘0’.

After finishing subsetting, we can re-check again to verify the data.

##   InvoiceNo          StockCode         Description           Quantity       
##  Length:392692      Length:392692      Length:392692      Min.   :    1.00  
##  Class :character   Class :character   Class :character   1st Qu.:    2.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    6.00  
##                                                           Mean   :   13.12  
##                                                           3rd Qu.:   12.00  
##                                                           Max.   :80995.00  
##   InvoiceDate                    UnitPrice          CustomerID   
##  Min.   :2010-12-01 08:26:00   Min.   :   0.001   Min.   :12346  
##  1st Qu.:2011-04-07 11:12:00   1st Qu.:   1.250   1st Qu.:13955  
##  Median :2011-07-31 12:02:00   Median :   1.950   Median :15150  
##  Mean   :2011-07-10 19:13:07   Mean   :   3.126   Mean   :15288  
##  3rd Qu.:2011-10-20 12:53:00   3rd Qu.:   3.750   3rd Qu.:16791  
##  Max.   :2011-12-09 12:50:00   Max.   :8142.750   Max.   :18287  
##    Country         
##  Length:392692     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

As we can see from the summary, the min. UnitPrice and Quantity are above ‘0’. Let’s move to Cohort Analysis step.

2.4 Cohort Analysis Step

For cohort analysis, there are a few labels that we have to create:

  • Invoice period: A string representation of the year and month of a single transaction/invoice.
  • Cohort group: A string representation of the the year and month of a customer’s first purchase. This label is common across all invoices for a particular customer.
  • Cohort period / Cohort Index: A integer representation a customer’s stage in its “lifetime”. The number represents the number of months passed since the first purchase.

2.4.1 Retention Rate Table

After managed to create Invoice Period, Cohort Month, and Cohort Index, we can make retention rate to see the percentage of active users comparing to total users.

After finished creating retention table, we can plot it on heatmap to see more clear picture of this table.

We find:

  • The highest retention rate is at Dec 2010 of Cohort Month

  • The trend shows the retention tending to go down slowly.

2.4.2 Average Quantity of Each Chorts

Beside retention table, we can make average quantity of active customers in each chorts.

We find:

  • the trend of average of quantity is going down.

  • For data at 8th CohortIndex of May 2011, we can conclude it as an outlier due to there is a high quantity purchased with one ID.

3 Seasonality Analysis

Before we do forecast modeling, we need to check what seasonality that our data is.

After convert our data hourly, we can check whether there is any transactions in every hour.

From the generated table, we can see that there are no transaction at 9 PM to 5 AM. Thus, we can takeout the zero transaction hours to create a better model.

3.1 Single Time Series

First step, let’s try with single time series and check whether all seasonalities can be explained / comprehended in this time series.

3.2 Decomposing

We find:

  • Since graphic in trend column haven’t been created smoothly, find out that there is seasonality that haven’t been caught in retail_ts. That means our data are most likely having multiple seasonality.

3.3 Multiple Time Series

Since we conclude that our time series is multiple time series, let’s move to multiple time series to see its seasonality.

As we can see, the trend is created smoothly indicating that our data set has multiple seasonality.

4 Modeling

4.1 Data pre-processing & Simple Modeling

For K-Mean Clustering, we are going to build RFM (Recency, Frequency, Monetary) model first. So, what is RFM?

  • Recency is about when was the last order of a customer. It means the number of days since a customer made the last purchase. If it’s a case for a website or an app, this could be interpreted as the last visit day or the last login time.

  • Frequency is about the number of purchase in a given period. It could be 3 months, 6 months or 1 year. So we can understand this value as for how often or how many a customer used the product of a company. The bigger the value is, the more engaged the customers are. Could we say them as our VIP? Not necessary. Cause we also have to think about how much they actually paid for each purchase, which means monetary value.

  • Monetary is the total amount of money a customer spent in that given period. Therefore big spenders will be differentiated with other customers such as MVP or VIP.

The RFM values can be grouped in several ways:
1.Percentiles e.g. quantiles
2.Pareto 80/20 cut
3.Custom - based on business knowledge

On this project, we are going to group by percentiles

Process of calculating percentiles:
1. Sort customers based on that metric
2. Break customers into a pre-defined number of groups of equal size
3. Assign a label to each group

As for first step, let’s add column to see total amount of purchasing that have been done.

After create Total_Purchase, let’s check first order and last

## [1] "2011-12-09 12:50:00 UTC"
## [1] "2010-12-01 08:26:00 UTC"

Since we want to calculate day periods, let’s set one day after the last one (December 9, 2011) as snapshot_date

## [1] "2011-12-10 12:50:00 UTC"

Now, we have got snapshot_date, we can now create our RFM table

After created RFM dataframe, we need to label atau put in Range for Recency, Frequency, Monetary. Note that:

  • We will rate Recency customer who have been active more recently better than the less recent customer,because each company wants its customers to be recent

  • We will rate Frequency and Monetary Value higher label because we want Customer to spend more money and visit more often(that is different order than recency)

After getting rfm_score, let’s group customers into Gold, Silver, Bronze

Let’s see the characteristic of each segment

We finds:

  • Characteristics of Bronze:
    The average Recency is above 180 days which mean most of Bronze segment was not doing transaction for 6 months since their last transaction. On the other hand, average Frequency is around 15 showing that most of customers were doing transaction around 15 times before they stopped purchasing six months later. Average of total purchasing of each customer (Monetary) is around GBP 266.
  • Characteristics of Silver:
    The average Recency is above 60 days that indicates most of Silver segment was not doing transaction for 2 months since their last transaction. On the other hand, average Frequency is around 49 times of purchasing. For Monetary value, Silver segment customers spent around 1000 GBP in online store.
  • Characteristics of Gold:
    The average Recency is around 20 days that indicates most of Gold segment was still doing transactions within one month since their last transactions. On the other hand, average Frequency is around 226 times of purchasing. For Monetary value, Gold segment customers spent around 5259 GBP in online store.

4.2 K-Means Clustering

To do K-Means Clustering, we need to scale the data first. Column that we are going to use is Recency, Frequency, Monetary

After done scaling, we need to decide value for ‘k’ to decide how many clustering our data will be

According to wss above, we can use 3 as our k value due to sloping graphic after 3 and we would like compare the characteristic of clusters to generated RFM class, we should use same number of clusters. We can start create our K-means clustering.

After clustering, let’s see the characteristic of each cluster.

As we can see characteristics of each cluster in K-Means clustering, we find:

  • Each cluster in K-Means has very different characteristics with Simple Clustering.
  • Specially the 3rd cluster of K-Means, it has a significantly different characteristic comparing to Gold cluster of Simple Clustering as we can see at below table.

4.3 K-Medoids Clustering

To use K-Medoids Clustering requiring scaled data, we can use the previous scaled data from K-Means clustering. And, we can use ‘3’ as our k-values since we would like to compare the result with other clustering model. Thus, in this session we can move directly to modeling.

Let’s check the characteristic of each cluster

We find:

  • Characteristics of K-Medoids clusters are more similiar to Simple Clustering

4.4 Clustering Analysis

For easiness to see distribution characteristic of each cluster of each data frame, let’s divide with average value of each column to have better understanding.

We find:

  • Simple clustering (plot 1) and K-Mediods clustering (plot 3) are alike.

  • At K-Mean clustering (plot2), there are very significantly different values between cluster 3 and the other cluster in term of Frequency and Monetary.

since we find out 3rd cluster of K-Means has significantly different value than the other clusters, let’s check how many customers are in this cluster.

## [1] 13

As we can see the result, only 13 of 4338 (less than 1%) customers are in this cluster, we can assume that all customers that are in this cluster are outliers.

5 Conclusion

As the end of this report, we can conclude that:

  • Most of customers prefer to do purchasing on middle of weeks, Wednesday & Thursday, at noon around 12.00 -13.00. With this information, we can make a lot promotion on Wednesday & Thursday or we can boost up other day sales by giving special discount.
  • As we find out simple model and K-Medoids model are alike, we find that one of cluster in K-Means model have significantly different value which we can assume that all customers are in that cluster are outliers. After doing clustering, we can make a special package for each class, so we can maintain existing customers or add new customers.
  • As we can see the similiarity characteristics of Simple Clustering and K-Medoids, we can use K-Medoids clustering for our modeling, thus we find (base on k-medoids clustering table):
    -Characteristics of 1st Cluster:
    The average Recency is above 180 days which mean most customers of Cluster 1 were not doing transaction for 6 months since their last transaction. On the other hand, average Frequency is around 25 showing that most of customers were doing transaction around 25 times before they stopped purchasing six months later. Average total purchasing value (Monetary) is around GBP 595.
    -Characteristics of 2nd Cluster:
    The average Recency is above 90 days that indicates most customers of Cluster 2 segment were not doing transaction for 3 months since their last transaction. On the other hand, average Frequency is around 41 times of purchasing. For Monetary value, customers of Cluster 2 spent around 862 GBP in online store.
    -Characteristics of 3rd Cluster:
    The average Recency is around 22 days that indicates most customers of Cluster 3 were still doing transactions within one month since their last transactions. On the other hand, average Frequency is around 142 times of purchasing. For Monetary value, customers spent around 3251 GBP in online store.