Segmatation of oranges
Recency, Frequency, & Monetary (RFM) is one of the techniques that can be used for customer segmentation and is one of the conventional ways for segmentation that been used for a long time.
Recency refers to when the customer did the most recent transaction using our product.
Frequency refers to how often customers do transactions using our product
Monetary Value refers to how much does a customer spend in our product
RFM method is straightforward; we only have to transform our data (usually in the shape of transactional data) into data frame consists with three variables Recent Transaction, Transaction Frequency, and Transaction Amount (Monetary Values).
Transactional data itself is the data which records or captures every transaction that been done by customers. Typically, transactional data consists of transaction time, transaction location, how much amount our customers spend, which merchant the deal took place, and every detail that can be recorded at the moment transactions were made.
Let see our transactional dataset that later will be used as our study case. Our dataset is a 2016 credit card transactional data from every customer. Transactions dataset consist of 24 features which recorded during every transaction our customers made. Even if we have many features on our dataset; we will not use all of them and only use small numbers of features which can be transformed into Recency, Frequency, and Monetary Values instead.
Link to dataset: https://www.kaggle.com/derykurniawan/credit-card-transaction
| Feature Names |
|---|
| Account number |
| Customer id |
| Credit limit |
| Available money |
| Transaction date time |
| Transaction amount |
| Merchant name |
| Acq country |
| Merchant country code |
| Pos entry mode |
| Pos condition code |
| Merchant category code |
| Current exp date |
| Account open date |
| Date of last address change |
| Card cvv |
| Entered cvv |
| Card last4digits |
| Transaction type |
| Is fraud |
| Current balance |
| Card present |
| Expiration date key in match |
If we return to our description of RFM features; we only have to keep customerId, transactionDate, and transactionAmount to create Recency, Frequency, and Transaction Amount features in the new data frame that grouped by customerId features.
For the Recency feature, we can subtract the current date with the maximum value of transactionDate (latest transaction). Since our dataset only contains 2016 transactional data, we will set 1st January 2017 as our current date.
For the Frequency feature, we count how many transactions were made for every customer using n() function in R.
for the Transaction Amount feature, we calculate the summation of transactionAmount for every customer.
Now we have three main feature for the RFM segmentation. It is similar to any other data analytical case, the first step that we have to do is exploring our dataset, and in this case, we will check every feature distribution using histogram plot using hist() function in R.
Our RFM dataset is so right-skewed, and it will be a catastrophic problem in K-Means clustering method since this method using the distance between points as one of its calculation to determine which cluster is the points fitted the most. Log transformation can be used to handle this kind of skewed data, and since we have 0 (zero values) in the data, we will use log(n + 1) to transform our data instead of the ordinary log transformation.
Logarithmic transformation provides better data for K-Means method to calculate and find the best cluster for our data by getting rid much of skewed data in our RFM dataset.
K-Means clustering method by definition is a type of unsupervised learning which has been used for defining the unlabeled data into groups based on its similarity.
In R, K-Means clustering can be quickly done using kmeans() function. But, we have to find the number of clusters before creating the K-Means model. There are so many ways to find the best number of groups to assign, one of them is by using our business sense and assign the number directly, or we also can use mathematical sense to calculate the similarity between each point.
On this example, we will use the within-cluster sum of squares that measures the variability of the observations within each cluster. We will iteratively calculate the within-cluster sum of squares for every cluster in range of 1 to 10 and choose the group with the lowest value and no further significant changes in value for its next cluster, or often we called it as the Elbow Method.
Using the elbow method, we will assign four groups as our number of clusters. Using kmeans() function in R we only need to put cluster number in centers parameter and assign the clustering results into our dataset.
| Segment | Recency | Frequency | Transaction amount | Members |
|---|---|---|---|---|
| 1 | 10.27 | 28.42 | 2,840 | 1,811 |
| 2 | 39.90 | 9.06 | 668 | 686 |
| 3 | 4.90 | 92.09 | 11,921 | 1,772 |
| 4 | 2.53 | 576.01 | 82,130 | 731 |
So, we have four groups and let’s discuss the detail for every group:
Segment-1 (Silver): Middle-class customer with second-most considerable transactions frequency and spending amount.
Segment-2 (Gold): Most valuable customers who have the most significant spending amount and the one who make transactions the most.
Segment-3 (Bronze): Commoner customer with low transactions frequency and low spending amount. But, this segment has the largest number of the customer.
Segment-4 (Inactive): Inactive/less-active customers whom latest transactions had done in more than a month ago. This segment has the lowest number of customer, transaction frequency, and transaction amount.
Now, we have four groups of customer with detailed RFM behaviour from each group. Usually this information can be used for arrange marketing strategy that well-targeted to the customers who share similar behaviour. Recency, Frequency, and Monetary Values segmentation is simple but useful for knowing your customer better and aiming an efficient and optimum marketing strategy.