Project Description

This project is a customer segmentation project that utilizes a tool called RFM analysis. RFM analysis divides customers into groups based on their different purchasing behaviors. The analysis is based on three data points:
1) Recency.
2) Frequency.
3) Monetary.

The details of each data point are as follows:

  1. Recency.
    Recency measures how recently a customer made a purchase. This can be determined by looking at the date of the customer’s last purchase. Customers who have made a purchase in the recent past are more likely to be active customers and are more likely to make repeat purchases.

  2. Frequency.
    Frequency measures how often a customer makes purchases. This can be determined by looking at the number of purchases a customer has made over a period of time. Customers who make frequent purchases are more likely to be loyal customers and are more likely to spend more money with the company.

  3. Monetary.
    Monetary measures how much money a customer spends on each purchase. This can be determined by looking at the total amount of money a customer has spent with the company over a period of time. Customers who spend more money with the company are more likely to be high-value customers and are more likely to be profitable for the company.

RFM analysis can be used to segment customers into different groups based on their RFM scores. For example, customers who have a high recency score, a high frequency score, and a high monetary score could be classified as “loyal customers.” These customers are likely to be the most valuable customers for the company and should be targeted with marketing campaigns that are designed to retain them. Customers who have a low recency score, a low frequency score, and a low monetary score could be classified as “lapsed customers.” These customers are not as valuable to the company and may be at risk of churning. Marketing campaigns that are designed to reactivate these customers could be targeted to them.

About Dataset

This dataset contains sales data for a retail store from 2022 to 2023. The data includes the following information:

No. Attribute Name Description Type of Data
1 InvoiveNo Number of invoice Text
2 StockCode Product code Text
3 Description Description of product Text
4 Quantity Amount purchased Integer
5 InvoiceDate Date of invoice Text
6 UnitPrice Price per unit of product Real
7 CustomerID ID of customer Text
8 Country Country of retail location Text

The sample data can be used to get a better understanding of the data and you can view here: retail_data

Scope of Project

  1. Querying the Data
    Query the SQLite database using Python. Establish a connection to the database and write an SQL script to retrieve the data necessary for RFM analysis.
  2. Data preparation
    Prepare the data using pandas and NumPy. Clean the data by removing null values and filtering it to specific requirements for RFM analysis.
  3. RFM analysis
    Perform RFM analysis using Python. Calculate three factors:
    • Recency: Time elapsed since the most recent purchase.
    • Frequency: Number of purchases within a defined timeframe.
    • Monetary value: Total amount spent by the customer.
  4. Customer segmentation
    4.1) Using table tool and scoring system.
    4.2) Using scatter plot tool.
    Segment customers using Python. Group them into 10 segments:
    • Champion: High-value recent buyers.
    • Loyal: High-value repeat buyers.
    • Cannot loss them: High-value lapsed customers.
    • Potential Loyalist: Medium-value recent buyers.
    • Need Attention: Medium-value infrequent buyers.
    • At risk: Medium-value lapsed customers.
    • New customer: Low-value recent buyers.
    • Promising: Low-value repeat buyers.
    • About to Sleep: Low-value infrequent buyers.
    • Hibernating: Low-value lapsed customers.

  1. Exporting the Results
    Export the RFM clusters as a CSV file using pandas. This allows for creating reports using business intelligence tools.

Let’s start!

Environment Set up

1)Enabled Python Engine to use package of python follow by path.

library(reticulate)
## Warning: package 'reticulate' was built under R version 4.3.1
use_python("/Library/Frameworks/Python.framework/Versions/3.11/bin/python3")

2)Import libraries that are related in this analysis as follows:
- NumPy: Blazing-fast arrays and math functions for numerical computing.
- Pandas: Flexible DataFrames for wrangling, analyzing, and visualizing tabular data.
- datetime: Accurately representing and manipulating dates, times, and durations.
- Matplotlib: Diverse plots and visualizations to explore and understand your data.
- Seaborn: Statistical-focused visualizations built on Matplotlib for deeper insights.
- sqlite3: Serve interface for interacting with SQLite databases.

import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

First step : Querying the Data

The first step involves querying the desired data for RFM analysis from the SQLite database. We need to filter the data to include only UK sales with a purchase quantity greater than zero. To perform RFM analysis in the following step, we will export this data to a CSV file.

try:
    ## Use the connect() method of the connector class with the database name. To establish a connection to SQLite.
    conn = sqlite3.connect('/Users/j.nrup/Documents/Data Project/RFM analysis/retail_data.db')

    ## Use the cursor() method of a connection class to create a cursor object to execute or interact with the database SQLite command/queries from Python. 
    cursor = conn.cursor()
    print("Successfully Connected to SQLite")
    
    ## Load the data into a DataFrame
    ## Focuesd the order in UK and filter cancel order out
    sales_uk = pd.read_sql_query("SELECT * FROM retail WHERE Country = 'United Kingdom' AND Quantity > 0", conn)
    print("Retail in UK", sales_uk)

    ## Write the sales_UK dataframe to CSV file
    sales_uk.to_csv("Retail_in_UK.csv")

    ## Use cursor.clsoe() method to close the cursor
    cursor.close()

except sqlite3.Error as error:
    print("Error while connecting to sqlite", error)
finally:
    ## Use connection.clsoe() method to close the SQLite connections.
    if conn:
        conn.close()
        print("The SQLite connection is closed")
## Successfully Connected to SQLite
## Retail in UK        InvoiceNo StockCode  ... CustomerID         Country
## 0         536365    85123A  ...      17850  United Kingdom
## 1         536365     71053  ...      17850  United Kingdom
## 2         536365    84406B  ...      17850  United Kingdom
## 3         536365    84029G  ...      17850  United Kingdom
## 4         536365    84029E  ...      17850  United Kingdom
## ...          ...       ...  ...        ...             ...
## 486281    581585     22466  ...      15804  United Kingdom
## 486282    581586     22061  ...      13113  United Kingdom
## 486283    581586     23275  ...      13113  United Kingdom
## 486284    581586     21217  ...      13113  United Kingdom
## 486285    581586     20685  ...      13113  United Kingdom
## 
## [486286 rows x 8 columns]
## The SQLite connection is closed

Second step : Data preparation

  1. Get the data from the CSV file exported in the previous step.
# Define the data types of columns while reading a CSV file.
dtype_mapping = {
    'InvoiceNo' : 'string',
    'StockNo' : 'string',
    'CustomerID' : 'string'
}
# Create Data frame called 'sales_in_uk' by CSV file reading.
sales_in_uk = pd.read_csv('Retail_in_UK.csv', encoding="ISO-8859-1", dtype=dtype_mapping)
# Set the first column as index
sales_in_uk.set_index(sales_in_uk.columns[0], inplace=True)
# Print Top 10 of sales_in_uk
sales_in_uk.head(10)
##            InvoiceNo StockCode  ... CustomerID         Country
## Unnamed: 0                      ...                           
## 0             536365    85123A  ...      17850  United Kingdom
## 1             536365     71053  ...      17850  United Kingdom
## 2             536365    84406B  ...      17850  United Kingdom
## 3             536365    84029G  ...      17850  United Kingdom
## 4             536365    84029E  ...      17850  United Kingdom
## 5             536365     22752  ...      17850  United Kingdom
## 6             536365     21730  ...      17850  United Kingdom
## 7             536366     22633  ...      17850  United Kingdom
## 8             536366     22632  ...      17850  United Kingdom
## 9             536367     84879  ...      13047  United Kingdom
## 
## [10 rows x 8 columns]
  1. Check dimensions of the dataset.
# Check number of row and column
sales_in_uk.shape
## (486286, 8)
  1. Remove rows where the Customer ID is null.
# Remove rows that CustomerID are null value.
sales_in_uk.dropna(subset=['CustomerID'], how='all', inplace=True)
# Check number of row and column again
sales_in_uk.shape
## (354345, 8)
  1. Verify data types of the “Invoice Date” column.
# Check data type of InvoiceDate
sales_in_uk['InvoiceDate'].dtype
## dtype('O')

From the result, ‘dtype(’O’) indicates that the data type of a column is ‘object’.

  1. Convert the “Invoice Date” column to a datetime data type and re-verify data type.
# Convert the InvoiceDate that is an objects to the datetime format and store them in a new column
sales_in_uk['InvoiceDate'] = pd.to_datetime(sales_in_uk['InvoiceDate'])
# Check data type of InvoiceDate again
sales_in_uk['InvoiceDate'].dtype
## dtype('<M8[ns]')

From the result, dtype(‘<M8[ns]’) indicates that the data type of a column is datetime64[ns] which is correctly.

  1. Filter data, Select only sales data from the year 2023.
# Focused on YEAR 2023
sales_2023 = sales_in_uk[sales_in_uk['InvoiceDate']> "2022-12-31"]
sales_2023.shape
## (330402, 8)
  1. Summarize the filtered data.
# Summary data that is prepared completely.
print("Summary of Data preparation")
## Summary of Data preparation
print("Number of Transaction : ",sales_2023['InvoiceNo'].nunique())
## Number of Transaction :  15358
print("Number of Product that is bought : ",sales_2023['StockCode'].nunique())
## Number of Product that is bought :  3575
print("Number of Customer : ",sales_2023['CustomerID'].nunique())
## Number of Customer :  3814
print("Total of Sales Quantity : ",sales_2023['Quantity'].sum())
## Total of Sales Quantity :  4001701
print("Total of Revenue in 2023 : ",round((sales_2023['Quantity']*sales_2023['UnitPrice']).sum(),0))
## Total of Revenue in 2023 :  6809730.0
print("Percentage of CustomersID that are null value : ",round(sales_2023['CustomerID'].isnull().sum()*100 / len(sales_2023),2)," %")
## Percentage of CustomersID that are null value :  0.0  %

Third step : RFM Analysis

  1. Recency Calculation.
    Reference Date: We need a reference date to calculate how many days ago a customer’s last purchase occurred.
# last date available in our dataset
last_date = sales_2023['InvoiceDate'].max()
print("Last date purchasing : ", last_date)
## Last date purchasing :  2023-12-09 12:49:00

We will use the last date : 2023-12-09 is reference.

now = dt.date(2023,12,9)
print(now)
## 2023-12-09

Add a new column named “date” containing the invoice date.

sales_2023['date'] = pd.DatetimeIndex(sales_2023['InvoiceDate']).date
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
sales_2023.head(10)
##            InvoiceNo StockCode  ...         Country        date
## Unnamed: 0                      ...                            
## 39425         539993     22386  ...  United Kingdom  2023-01-04
## 39426         539993     21499  ...  United Kingdom  2023-01-04
## 39427         539993     21498  ...  United Kingdom  2023-01-04
## 39428         539993     22379  ...  United Kingdom  2023-01-04
## 39429         539993     20718  ...  United Kingdom  2023-01-04
## 39430         539993    85099B  ...  United Kingdom  2023-01-04
## 39431         539993     20682  ...  United Kingdom  2023-01-04
## 39432         539993     22961  ...  United Kingdom  2023-01-04
## 39433         539993     22667  ...  United Kingdom  2023-01-04
## 39434         539993     22898  ...  United Kingdom  2023-01-04
## 
## [10 rows x 9 columns]

Group the data by CustomerID and find the latest date of purchase for each customer.

r_df = sales_2023.groupby(by='CustomerID', as_index=False)['date'].max()
r_df.columns = ['CustomerID','LastPurchaseDate']
r_df.head(10)
##   CustomerID LastPurchaseDate
## 0      12346       2023-01-18
## 1      12747       2023-12-07
## 2      12748       2023-12-09
## 3      12749       2023-12-06
## 4      12820       2023-12-06
## 5      12821       2023-05-09
## 6      12822       2023-09-30
## 7      12823       2023-09-26
## 8      12824       2023-10-11
## 9      12826       2023-12-07

Calculate the recency score for each customer by finding the difference between the reference date and their latest purchase date in days.

r_df['Recency'] = r_df['LastPurchaseDate'].apply(lambda x : (now - x).days)
r_df.head(10)
##   CustomerID LastPurchaseDate  Recency
## 0      12346       2023-01-18      325
## 1      12747       2023-12-07        2
## 2      12748       2023-12-09        0
## 3      12749       2023-12-06        3
## 4      12820       2023-12-06        3
## 5      12821       2023-05-09      214
## 6      12822       2023-09-30       70
## 7      12823       2023-09-26       74
## 8      12824       2023-10-11       59
## 9      12826       2023-12-07        2

Remove the unnecessary “LastPurchaseDate” column.

r_df.drop('LastPurchaseDate',axis=1, inplace=True)
r_df.head(10)
##   CustomerID  Recency
## 0      12346      325
## 1      12747        2
## 2      12748        0
## 3      12749        3
## 4      12820        3
## 5      12821      214
## 6      12822       70
## 7      12823       74
## 8      12824       59
## 9      12826        2
  1. Frequency Calculation.
    Remove duplicates value.
sales_2023_copy = sales_2023
print(sales_2023_copy.shape)
## (330402, 9)
sales_2023_copy.drop_duplicates(subset=['InvoiceDate','CustomerID'],keep='first',inplace=True)
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
print(sales_2023_copy.shape)
## (15279, 9)

Count the number of invoices associated with each CustomerID to determine their frequency of purchase.

f_df = sales_2023_copy.groupby(by=['CustomerID'], as_index=False)['InvoiceNo'].count()
f_df.columns = ['CustomerID','Frequency']
f_df.head(10)
##   CustomerID  Frequency
## 0      12346          1
## 1      12747          9
## 2      12748        176
## 3      12749          5
## 4      12820          4
## 5      12821          1
## 6      12822          2
## 7      12823          5
## 8      12824          1
## 9      12826          6
  1. Monetary calculation.
    Calculate the total cost for each invoice and add it to a new column named “Total Cost”.
# Create column total cost
sales_2023['TotalCost'] = sales_2023['Quantity'] * sales_2023['UnitPrice']
## <string>:2: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Calculate the monetary value for each customer by summing their “Total Cost” across all invoices.

m_df = sales_2023.groupby(by='CustomerID', as_index=False).agg({'TotalCost': 'sum'})
m_df.columns = ['CustomerID', 'Monetary']
m_df.head(10)
##   CustomerID  Monetary
## 0      12346  77183.60
## 1      12747    563.69
## 2      12748   3385.53
## 3      12749     98.35
## 4      12820     58.20
## 5      12821     19.92
## 6      12822     30.80
## 7      12823   1759.50
## 8      12824      5.04
## 9      12826    105.10
  1. Summarize RFM:
    Merge the recency, frequency, and monetary value dataframes into a single dataframe named “rfm_df”.
# Recency & Frequency 
rf_df = r_df.merge(f_df,on='CustomerID')
# Recency & Frequency & Monetary
rfm_df = rf_df.merge(m_df,on='CustomerID')
rfm_df.head(10)
##   CustomerID  Recency  Frequency  Monetary
## 0      12346      325          1  77183.60
## 1      12747        2          9    563.69
## 2      12748        0        176   3385.53
## 3      12749        3          5     98.35
## 4      12820        3          4     58.20
## 5      12821      214          1     19.92
## 6      12822       70          2     30.80
## 7      12823       74          5   1759.50
## 8      12824       59          1      5.04
## 9      12826        2          6    105.10

Set the “CustomerID” as the index of the “rfm” dataframe.

rfm_df.set_index('CustomerID',inplace=True)
rfm_df.head(10)
##             Recency  Frequency  Monetary
## CustomerID                              
## 12346           325          1  77183.60
## 12747             2          9    563.69
## 12748             0        176   3385.53
## 12749             3          5     98.35
## 12820             3          4     58.20
## 12821           214          1     19.92
## 12822            70          2     30.80
## 12823            74          5   1759.50
## 12824            59          1      5.04
## 12826             2          6    105.10

Verify RFM Table: Ensure the resulting “rfm” table accurately represents the recency, frequency, and monetary values for each customer.
- TRUE : Correct - FALSE : Incorrect

Cus_Insp_df = sales_2023[sales_2023['CustomerID']=='12820']
Cus_Insp_df
##            InvoiceNo StockCode  ...        date  TotalCost
## Unnamed: 0                      ...                       
## 54523         541283     21977  ...  2023-01-17       13.2
## 322249        568236     23328  ...  2023-09-26       15.0
## 374615        572873     23436  ...  2023-10-26       15.0
## 477157        580973     21098  ...  2023-12-06       15.0
## 
## [4 rows x 10 columns]
print('Last purchase of 12820',Cus_Insp_df['date'].max())
## Last purchase of 12820 2023-12-06
print('Recency of 12820: ', rfm_df.loc['12820','Recency'])
## Recency of 12820:  3
(now - dt.date(2023,12,6)).days == 3
## True

Third step : Customer segmentation

  1. Using table tool and scoring system : Find specific customer.
    The simplest way to create customer segments from the RFM model is to use quartiles. We assign scores from 1 to 4 to recency, frequency, and monetary value, where 4 represents the best/highest value and 1 represents the lowest/worst value. A final RFM score is calculated by combining the individual scores.
    Note: While quintiles (scores from 1 to 5) offer more granularity if your business needs it, they also result in 555 possible combinations, making segment creation more challenging. Therefore, we recommend using quartiles here.
## RFM Quartiles
rfm_qua = rfm_df.quantile(q=[0.25,0.5,0.75])
rfm_qua
##       Recency  Frequency  Monetary
## 0.25     17.0        1.0     17.40
## 0.50     47.0        2.0     44.26
## 0.75    128.0        4.0    118.80
# Transform to Dict data type
rfm_qua.to_dict()
## {'Recency': {0.25: 17.0, 0.5: 47.0, 0.75: 128.0}, 'Frequency': {0.25: 1.0, 0.5: 2.0, 0.75: 4.0}, 'Monetary': {0.25: 17.4, 0.5: 44.26, 0.75: 118.80000000000001}}

Creating RFM Segmentation:
This segmentation will leverage two key classes:
- High recency: This is considered negative because it indicates customers haven’t purchased recently.
- High frequency and monetary value: These are positive factors signifying active and valuable customers.

# First segmentation : Recency
# Arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)
def RScore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1

# Second segmentation : Frequency & Monetary
# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def FMScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4

Create rfm segmentation table.

#create rfm segmentation table
rfm_segmentation = rfm_df
rfm_segmentation['R'] = rfm_segmentation['Recency'].apply(RScore, args=('Recency',rfm_qua))
rfm_segmentation['F'] = rfm_segmentation['Frequency'].apply(FMScore, args=('Frequency',rfm_qua))
rfm_segmentation['M'] = rfm_segmentation['Monetary'].apply(FMScore, args=('Monetary',rfm_qua))
rfm_segmentation.head(10)
##             Recency  Frequency  Monetary  R  F  M
## CustomerID                                       
## 12346           325          1  77183.60  1  1  4
## 12747             2          9    563.69  4  4  4
## 12748             0        176   3385.53  4  4  4
## 12749             3          5     98.35  4  4  3
## 12820             3          4     58.20  4  3  3
## 12821           214          1     19.92  1  1  2
## 12822            70          2     30.80  2  2  2
## 12823            74          5   1759.50  2  4  4
## 12824            59          1      5.04  2  1  1
## 12826             2          6    105.10  4  4  3

With individual customer scores in hand, we’re ready to represent our customer segmentation.
Our first step is to combine the R, F, and M scores into a single representation.

rfm_segmentation['RFM_Score'] = rfm_segmentation.R.map(str) \
                            + rfm_segmentation.F.map(str) \
                            + rfm_segmentation.M.map(str)
rfm_segmentation.head(10)
##             Recency  Frequency  Monetary  R  F  M RFM_Score
## CustomerID                                                 
## 12346           325          1  77183.60  1  1  4       114
## 12747             2          9    563.69  4  4  4       444
## 12748             0        176   3385.53  4  4  4       444
## 12749             3          5     98.35  4  4  3       443
## 12820             3          4     58.20  4  3  3       433
## 12821           214          1     19.92  1  1  2       112
## 12822            70          2     30.80  2  2  2       222
## 12823            74          5   1759.50  2  4  4       244
## 12824            59          1      5.04  2  1  1       211
## 12826             2          6    105.10  4  4  3       443

Our RFM analysis assigns the highest score of 4 to customers with:
- Highest Recency: Most recent purchase.
- Highest Frequency: Largest number of purchases.
- Highest Monetary: Highest total spending.
Let’s identify these “Champion” customers, representing our most valuable segment.

rfm_segmentation[rfm_segmentation['RFM_Score']=='444'].sort_values('Monetary', ascending=False).head(10)
##             Recency  Frequency  Monetary  R  F  M RFM_Score
## CustomerID                                                 
## 18102             0         52  36181.65  4  4  4       444
## 17949             1         44  28685.69  4  4  4       444
## 17450             8         44  25953.51  4  4  4       444
## 16333             7         21  14113.68  4  4  4       444
## 16013             3         39  13981.48  4  4  4       444
## 15769             7         24  11390.84  4  4  4       444
## 12901             8         28   9230.45  4  4  4       444
## 13798             1         55   7617.26  4  4  4       444
## 17857             4         21   7384.68  4  4  4       444
## 13694             3         48   7265.06  4  4  4       444

This section provides suggested customer segments for further analysis. Note that the linked source uses an opposite scoring system, where 1 is the highest/best score and 4 is the lowest.
Before proceeding, let’s determine the number of customers belonging to each segment based on our scoring system.

print("Best Customers: ",len(rfm_segmentation[rfm_segmentation['RFM_Score']=='444']))
## Best Customers:  374
print('Loyal Customers: ',len(rfm_segmentation[rfm_segmentation['F']==4]))
## Loyal Customers:  928
print("Big Spenders: ",len(rfm_segmentation[rfm_segmentation['M']==4]))
## Big Spenders:  953
print('Almost Lost: ', len(rfm_segmentation[rfm_segmentation['RFM_Score']=='244']))
## Almost Lost:  85
print('Lost Customers: ',len(rfm_segmentation[rfm_segmentation['RFM_Score']=='144']))
## Lost Customers:  10
print('Lost Cheap Customers: ',len(rfm_segmentation[rfm_segmentation['RFM_Score']=='111']))
## Lost Cheap Customers:  352
  1. Using scatter plot tool.
    Segment customers using Python. Group them into ten segments:
  • Champion : 3.4 < r <= 4, 10 < FxM <= 16.
  • Loyal : 2.2 < r <= 3.4, 10 < FxM <= 16.
  • Cannot Lose Them : 1 <= r =< 2.2, 10 < FxM <= 16.
  • Potential Loyalist : 2.8 < r <= 4, 6 < FxM <= 10.
  • Need Attention : 2.2 < r <= 2.8, 6 < FxM <= 10.
  • At Risk : 1 <= r <= 2.2, 6 < FxM <= 10.
  • New Customer : 3.4 < r <= 4, 1 <= FxM <= 6.
  • Promising : 2.8 < r <= 3.4, 1 <= FxM <= 6.
  • About to Sleep : 2.2 < r <= 2.8, 1 <= FxM <= 6.
  • Hilbernating : 1 <= r <= 2.2, 1 <= FxM <= 6.

2.1) Define cluster code.

rfm_segment = rfm_segmentation.copy()

# Create function for define cluster code
def assign_clusterCode(row):
    if row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] > 3.4 and row['R'] <= 4:
        return 1
    elif row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] > 2.2 and row['R'] <= 3.4:
        return 2
    elif row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] >= 1 and row['R'] <= 2.2:
        return 3
    elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] > 2.8 and row['R'] <= 4: 
        return 4
    elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] > 2.2 and row['R'] <= 2.8:
        return 5
    elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] >= 1 and row['R'] <= 2.2:
        return 6
    elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 3.4 and row['R'] <= 4:
        return 7
    elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 2.8 and row['R'] <= 3.4:
        return 8
    elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 2.2 and row['R'] <= 2.8:
        return 9
    else:
        return 10

# Apply function to each row to create 'Cluster' column
rfm_segment['ClusterCode'] = rfm_segment.apply(assign_clusterCode, axis=1)
rfm_segment.head(10)
##             Recency  Frequency  Monetary  R  F  M RFM_Score  ClusterCode
## CustomerID                                                              
## 12346           325          1  77183.60  1  1  4       114           10
## 12747             2          9    563.69  4  4  4       444            1
## 12748             0        176   3385.53  4  4  4       444            1
## 12749             3          5     98.35  4  4  3       443            1
## 12820             3          4     58.20  4  3  3       433            4
## 12821           214          1     19.92  1  1  2       112           10
## 12822            70          2     30.80  2  2  2       222           10
## 12823            74          5   1759.50  2  4  4       244            3
## 12824            59          1      5.04  2  1  1       211           10
## 12826             2          6    105.10  4  4  3       443            1

2.2) Define cluster name.

# Create function for define cluster name 10
def assign_clusterName(row):
    if row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] > 3.4 and row['R'] <= 4:
        return 'Champion'
    elif row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] > 2.2 and row['R'] <= 3.4:
        return 'Loyal'
    elif row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] >= 1 and row['R'] <= 2.2:
        return 'Cannot Lose them'
    elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] > 2.8 and row['R'] <= 4: 
        return 'Potential Loyalist'
    elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] > 2.2 and row['R'] <= 2.8:
        return 'Need Attention'
    elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] >= 1 and row['R'] <= 2.2:
        return 'At risk'
    elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 3.4 and row['R'] <= 4:
        return 'New Customer'
    elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 2.8 and row['R'] <= 3.4:
        return 'Promising'
    elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 2.2 and row['R'] <= 2.8:
        return 'About to Sleep'
    else:
        return 'Hibernating'

# Apply function to each row to create 'Cluster' column
rfm_segment['ClusterName'] = rfm_segment.apply(assign_clusterName, axis=1)
rfm_segment.head(10)
##             Recency  Frequency  ...  ClusterCode         ClusterName
## CustomerID                      ...                                 
## 12346           325          1  ...           10         Hibernating
## 12747             2          9  ...            1            Champion
## 12748             0        176  ...            1            Champion
## 12749             3          5  ...            1            Champion
## 12820             3          4  ...            4  Potential Loyalist
## 12821           214          1  ...           10         Hibernating
## 12822            70          2  ...           10         Hibernating
## 12823            74          5  ...            3    Cannot Lose them
## 12824            59          1  ...           10         Hibernating
## 12826             2          6  ...            1            Champion
## 
## [10 rows x 9 columns]

2.3) Create FxM attribute.

# Create column FxM
rfm_segment['FxM'] = rfm_segment['F'] * rfm_segment['M']
rfm_segment.head(10)
##             Recency  Frequency  Monetary  ...  ClusterCode         ClusterName  FxM
## CustomerID                                ...                                      
## 12346           325          1  77183.60  ...           10         Hibernating    4
## 12747             2          9    563.69  ...            1            Champion   16
## 12748             0        176   3385.53  ...            1            Champion   16
## 12749             3          5     98.35  ...            1            Champion   12
## 12820             3          4     58.20  ...            4  Potential Loyalist    9
## 12821           214          1     19.92  ...           10         Hibernating    2
## 12822            70          2     30.80  ...           10         Hibernating    4
## 12823            74          5   1759.50  ...            3    Cannot Lose them   16
## 12824            59          1      5.04  ...           10         Hibernating    1
## 12826             2          6    105.10  ...            1            Champion   12
## 
## [10 rows x 10 columns]

2.4 ) Create customer segmentation by scatter plot.

# Define customer segments
segments = {
    1: "Champion",
    2: "Loyal",
    3: "Cannot loss them",
    4: "Potential Loyalist",
    5: "Need Attention",
    6: "At risk",
    7: "New Customer",
    8: "Promising",
    9: "About to Sleep",
    10: "Hibernating"
}

# Define colors for each segment
colors = {
    "Champion" : "gold",
    "Loyal" : "silver",
    "Cannot loss them" : "orange",
    "Potential Loyalist" : "green",
    "Need Attention" : "blue",
    "At risk" : "red",
    "New Customer" : "pink",
    "Promising" : "brown",
    "About to Sleep" : "gray",
    "Hibernating" : "black"
}

# Plot scatter plot with legend
plt.figure(figsize=(10, 6))
for segment, group in rfm_segment.groupby('ClusterCode'):
    label = segments[segment]
    color = colors[label]
    plt.scatter(group['R'], group['FxM'], label=label, color=color)
    # Annotate each point with CustomerID
    for customer_id, r, f in zip(group.index, group['R'], group['FxM']):
        plt.annotate(customer_id, (r, f), fontsize=8)


plt.xlabel('Recency')
plt.ylabel('Frequency X Monetary')
plt.title('Customer Segmentation')
plt.legend(loc='right', bbox_to_anchor=(1.4, 0.5), ncol=1)
plt.show()

Last step : Exporting the Results

## CSV file exporting to create report by BI tool.
rfm_segment.to_csv('Cluster_data.csv')

Create a sales analysis report and perform customer segmentation using RFM Analysis. Use the BI tool Tableau Public to display the results as follows:

Sales RFM Analysis Report
Sales RFM Analysis Report

You can visit the report here: Sales RFM Analysis Report

Conclusion

RFM analysis is a powerful tool that can be used to segment customers and target marketing campaigns. By understanding the different purchasing behaviors of customers, businesses can create marketing campaigns that are more likely to be successful.