Credit analysis of transaction data: K-Means clustering approach.

Segmatation of oranges

Segmatation of oranges

Recency, Frequency, & Monetary (RFM) is one of the techniques that can be used for customer segmentation and is one of the conventional ways for segmentation that been used for a long time.

Recency refers to when the customer did the most recent transaction using our product.

Frequency refers to how often customers do transactions using our product

Monetary Value refers to how much does a customer spend in our product

RFM method is straightforward; we only have to transform our data (usually in the shape of transactional data) into data frame consists with three variables Recent Transaction, Transaction Frequency, and Transaction Amount (Monetary Values).

Transactional data itself is the data which records or captures every transaction that been done by customers. Typically, transactional data consists of transaction time, transaction location, how much amount our customers spend, which merchant the deal took place, and every detail that can be recorded at the moment transactions were made.

Let see our transactional dataset that later will be used as our study case. Our dataset is a 2016 credit card transactional data from every customer. Transactions dataset consist of 24 features which recorded during every transaction our customers made. Even if we have many features on our dataset; we will not use all of them and only use small numbers of features which can be transformed into Recency, Frequency, and Monetary Values instead.

Link to dataset: https://www.kaggle.com/derykurniawan/credit-card-transaction

Features of transactional data
Feature Names
Account number
Customer id
Credit limit
Available money
Transaction date time
Transaction amount
Merchant name
Acq country
Merchant country code
Pos entry mode
Pos condition code
Merchant category code
Current exp date
Account open date
Date of last address change
Card cvv
Entered cvv
Card last4digits
Transaction type
Is fraud
Current balance
Card present
Expiration date key in match

If we return to our description of RFM features; we only have to keep customerId, transactionDate, and transactionAmount to create Recency, Frequency, and Transaction Amount features in the new data frame that grouped by customerId features.

For the Recency feature, we can subtract the current date with the maximum value of transactionDate (latest transaction). Since our dataset only contains 2016 transactional data, we will set 1st January 2017 as our current date.

For the Frequency feature, we count how many transactions were made for every customer using n() function in R.

for the Transaction Amount feature, we calculate the summation of transactionAmount for every customer.

Now we have three main feature for the RFM segmentation. It is similar to any other data analytical case, the first step that we have to do is exploring our dataset, and in this case, we will check every feature distribution using histogram plot using hist() function in R.

Our RFM dataset is so right-skewed, and it will be a catastrophic problem in K-Means clustering method since this method using the distance between points as one of its calculation to determine which cluster is the points fitted the most. Log transformation can be used to handle this kind of skewed data, and since we have 0 (zero values) in the data, we will use log(n + 1) to transform our data instead of the ordinary log transformation.

Logarithmic transformation provides better data for K-Means method to calculate and find the best cluster for our data by getting rid much of skewed data in our RFM dataset.

K-Means Clustering

K-Means clustering method by definition is a type of unsupervised learning which has been used for defining the unlabeled data into groups based on its similarity.

In R, K-Means clustering can be quickly done using kmeans() function. But, we have to find the number of clusters before creating the K-Means model. There are so many ways to find the best number of groups to assign, one of them is by using our business sense and assign the number directly, or we also can use mathematical sense to calculate the similarity between each point.

On this example, we will use the within-cluster sum of squares that measures the variability of the observations within each cluster. We will iteratively calculate the within-cluster sum of squares for every cluster in range of 1 to 10 and choose the group with the lowest value and no further significant changes in value for its next cluster, or often we called it as the Elbow Method.

Using the elbow method, we will assign four groups as our number of clusters. Using kmeans() function in R we only need to put cluster number in centers parameter and assign the clustering results into our dataset.

RFM Summary per Segment
Segment Recency Frequency Transaction amount Members
1 10.27 28.42 2,840 1,811
2 39.90 9.06 668 686
3 4.90 92.09 11,921 1,772
4 2.53 576.01 82,130 731

So, we have four groups and let’s discuss the detail for every group:

  1. Segment-1 (Silver): Middle-class customer with second-most considerable transactions frequency and spending amount.

  2. Segment-2 (Gold): Most valuable customers who have the most significant spending amount and the one who make transactions the most.

  3. Segment-3 (Bronze): Commoner customer with low transactions frequency and low spending amount. But, this segment has the largest number of the customer.

  4. Segment-4 (Inactive): Inactive/less-active customers whom latest transactions had done in more than a month ago. This segment has the lowest number of customer, transaction frequency, and transaction amount.

Now, we have four groups of customer with detailed RFM behaviour from each group. Usually this information can be used for arrange marketing strategy that well-targeted to the customers who share similar behaviour. Recency, Frequency, and Monetary Values segmentation is simple but useful for knowing your customer better and aiming an efficient and optimum marketing strategy.