This project is a customer segmentation project that utilizes a tool
called RFM analysis. RFM analysis divides customers
into groups based on their different purchasing behaviors. The analysis
is based on three data points:
1) Recency.
2) Frequency.
3) Monetary.
The details of each data point are as follows:
Recency.
Recency measures how recently a customer made a purchase. This can be
determined by looking at the date of the customer’s last purchase.
Customers who have made a purchase in the recent past are more likely to
be active customers and are more likely to make repeat
purchases.
Frequency.
Frequency measures how often a customer makes purchases. This can be
determined by looking at the number of purchases a customer has made
over a period of time. Customers who make frequent purchases are more
likely to be loyal customers and are more likely to spend more money
with the company.
Monetary.
Monetary measures how much money a customer spends on each purchase.
This can be determined by looking at the total amount of money a
customer has spent with the company over a period of time. Customers who
spend more money with the company are more likely to be high-value
customers and are more likely to be profitable for the company.
RFM analysis can be used to segment customers into different groups based on their RFM scores. For example, customers who have a high recency score, a high frequency score, and a high monetary score could be classified as “loyal customers.” These customers are likely to be the most valuable customers for the company and should be targeted with marketing campaigns that are designed to retain them. Customers who have a low recency score, a low frequency score, and a low monetary score could be classified as “lapsed customers.” These customers are not as valuable to the company and may be at risk of churning. Marketing campaigns that are designed to reactivate these customers could be targeted to them.
This dataset contains sales data for a retail store from 2022 to 2023. The data includes the following information:
| No. | Attribute Name | Description | Type of Data |
|---|---|---|---|
| 1 | InvoiveNo | Number of invoice | Text |
| 2 | StockCode | Product code | Text |
| 3 | Description | Description of product | Text |
| 4 | Quantity | Amount purchased | Integer |
| 5 | InvoiceDate | Date of invoice | Text |
| 6 | UnitPrice | Price per unit of product | Real |
| 7 | CustomerID | ID of customer | Text |
| 8 | Country | Country of retail location | Text |
The sample data can be used to get a better understanding of the data and you can view here: retail_data
1)Enabled Python Engine to use package of python follow by path.
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.3.1
use_python("/Library/Frameworks/Python.framework/Versions/3.11/bin/python3")
2)Import libraries that are related in this analysis as
follows:
- NumPy: Blazing-fast arrays and math functions for numerical
computing.
- Pandas: Flexible DataFrames for wrangling, analyzing, and visualizing
tabular data.
- datetime: Accurately representing and manipulating dates, times, and
durations.
- Matplotlib: Diverse plots and visualizations to explore and understand
your data.
- Seaborn: Statistical-focused visualizations built on Matplotlib for
deeper insights.
- sqlite3: Serve interface for interacting with SQLite databases.
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
The first step involves querying the desired data for RFM analysis from the SQLite database. We need to filter the data to include only UK sales with a purchase quantity greater than zero. To perform RFM analysis in the following step, we will export this data to a CSV file.
try:
## Use the connect() method of the connector class with the database name. To establish a connection to SQLite.
conn = sqlite3.connect('/Users/j.nrup/Documents/Data Project/RFM analysis/retail_data.db')
## Use the cursor() method of a connection class to create a cursor object to execute or interact with the database SQLite command/queries from Python.
cursor = conn.cursor()
print("Successfully Connected to SQLite")
## Load the data into a DataFrame
## Focuesd the order in UK and filter cancel order out
sales_uk = pd.read_sql_query("SELECT * FROM retail WHERE Country = 'United Kingdom' AND Quantity > 0", conn)
print("Retail in UK", sales_uk)
## Write the sales_UK dataframe to CSV file
sales_uk.to_csv("Retail_in_UK.csv")
## Use cursor.clsoe() method to close the cursor
cursor.close()
except sqlite3.Error as error:
print("Error while connecting to sqlite", error)
finally:
## Use connection.clsoe() method to close the SQLite connections.
if conn:
conn.close()
print("The SQLite connection is closed")
## Successfully Connected to SQLite
## Retail in UK InvoiceNo StockCode ... CustomerID Country
## 0 536365 85123A ... 17850 United Kingdom
## 1 536365 71053 ... 17850 United Kingdom
## 2 536365 84406B ... 17850 United Kingdom
## 3 536365 84029G ... 17850 United Kingdom
## 4 536365 84029E ... 17850 United Kingdom
## ... ... ... ... ... ...
## 486281 581585 22466 ... 15804 United Kingdom
## 486282 581586 22061 ... 13113 United Kingdom
## 486283 581586 23275 ... 13113 United Kingdom
## 486284 581586 21217 ... 13113 United Kingdom
## 486285 581586 20685 ... 13113 United Kingdom
##
## [486286 rows x 8 columns]
## The SQLite connection is closed
# Define the data types of columns while reading a CSV file.
dtype_mapping = {
'InvoiceNo' : 'string',
'StockNo' : 'string',
'CustomerID' : 'string'
}
# Create Data frame called 'sales_in_uk' by CSV file reading.
sales_in_uk = pd.read_csv('Retail_in_UK.csv', encoding="ISO-8859-1", dtype=dtype_mapping)
# Set the first column as index
sales_in_uk.set_index(sales_in_uk.columns[0], inplace=True)
# Print Top 10 of sales_in_uk
sales_in_uk.head(10)
## InvoiceNo StockCode ... CustomerID Country
## Unnamed: 0 ...
## 0 536365 85123A ... 17850 United Kingdom
## 1 536365 71053 ... 17850 United Kingdom
## 2 536365 84406B ... 17850 United Kingdom
## 3 536365 84029G ... 17850 United Kingdom
## 4 536365 84029E ... 17850 United Kingdom
## 5 536365 22752 ... 17850 United Kingdom
## 6 536365 21730 ... 17850 United Kingdom
## 7 536366 22633 ... 17850 United Kingdom
## 8 536366 22632 ... 17850 United Kingdom
## 9 536367 84879 ... 13047 United Kingdom
##
## [10 rows x 8 columns]
# Check number of row and column
sales_in_uk.shape
## (486286, 8)
# Remove rows that CustomerID are null value.
sales_in_uk.dropna(subset=['CustomerID'], how='all', inplace=True)
# Check number of row and column again
sales_in_uk.shape
## (354345, 8)
# Check data type of InvoiceDate
sales_in_uk['InvoiceDate'].dtype
## dtype('O')
From the result, ‘dtype(’O’) indicates that the data type of a column is ‘object’.
# Convert the InvoiceDate that is an objects to the datetime format and store them in a new column
sales_in_uk['InvoiceDate'] = pd.to_datetime(sales_in_uk['InvoiceDate'])
# Check data type of InvoiceDate again
sales_in_uk['InvoiceDate'].dtype
## dtype('<M8[ns]')
From the result, dtype(‘<M8[ns]’) indicates that the data type of a column is datetime64[ns] which is correctly.
# Focused on YEAR 2023
sales_2023 = sales_in_uk[sales_in_uk['InvoiceDate']> "2022-12-31"]
sales_2023.shape
## (330402, 8)
# Summary data that is prepared completely.
print("Summary of Data preparation")
## Summary of Data preparation
print("Number of Transaction : ",sales_2023['InvoiceNo'].nunique())
## Number of Transaction : 15358
print("Number of Product that is bought : ",sales_2023['StockCode'].nunique())
## Number of Product that is bought : 3575
print("Number of Customer : ",sales_2023['CustomerID'].nunique())
## Number of Customer : 3814
print("Total of Sales Quantity : ",sales_2023['Quantity'].sum())
## Total of Sales Quantity : 4001701
print("Total of Revenue in 2023 : ",round((sales_2023['Quantity']*sales_2023['UnitPrice']).sum(),0))
## Total of Revenue in 2023 : 6809730.0
print("Percentage of CustomersID that are null value : ",round(sales_2023['CustomerID'].isnull().sum()*100 / len(sales_2023),2)," %")
## Percentage of CustomersID that are null value : 0.0 %
# last date available in our dataset
last_date = sales_2023['InvoiceDate'].max()
print("Last date purchasing : ", last_date)
## Last date purchasing : 2023-12-09 12:49:00
We will use the last date : 2023-12-09 is reference.
now = dt.date(2023,12,9)
print(now)
## 2023-12-09
Add a new column named “date” containing the invoice date.
sales_2023['date'] = pd.DatetimeIndex(sales_2023['InvoiceDate']).date
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
sales_2023.head(10)
## InvoiceNo StockCode ... Country date
## Unnamed: 0 ...
## 39425 539993 22386 ... United Kingdom 2023-01-04
## 39426 539993 21499 ... United Kingdom 2023-01-04
## 39427 539993 21498 ... United Kingdom 2023-01-04
## 39428 539993 22379 ... United Kingdom 2023-01-04
## 39429 539993 20718 ... United Kingdom 2023-01-04
## 39430 539993 85099B ... United Kingdom 2023-01-04
## 39431 539993 20682 ... United Kingdom 2023-01-04
## 39432 539993 22961 ... United Kingdom 2023-01-04
## 39433 539993 22667 ... United Kingdom 2023-01-04
## 39434 539993 22898 ... United Kingdom 2023-01-04
##
## [10 rows x 9 columns]
Group the data by CustomerID and find the latest date of purchase for each customer.
r_df = sales_2023.groupby(by='CustomerID', as_index=False)['date'].max()
r_df.columns = ['CustomerID','LastPurchaseDate']
r_df.head(10)
## CustomerID LastPurchaseDate
## 0 12346 2023-01-18
## 1 12747 2023-12-07
## 2 12748 2023-12-09
## 3 12749 2023-12-06
## 4 12820 2023-12-06
## 5 12821 2023-05-09
## 6 12822 2023-09-30
## 7 12823 2023-09-26
## 8 12824 2023-10-11
## 9 12826 2023-12-07
Calculate the recency score for each customer by finding the difference between the reference date and their latest purchase date in days.
r_df['Recency'] = r_df['LastPurchaseDate'].apply(lambda x : (now - x).days)
r_df.head(10)
## CustomerID LastPurchaseDate Recency
## 0 12346 2023-01-18 325
## 1 12747 2023-12-07 2
## 2 12748 2023-12-09 0
## 3 12749 2023-12-06 3
## 4 12820 2023-12-06 3
## 5 12821 2023-05-09 214
## 6 12822 2023-09-30 70
## 7 12823 2023-09-26 74
## 8 12824 2023-10-11 59
## 9 12826 2023-12-07 2
Remove the unnecessary “LastPurchaseDate” column.
r_df.drop('LastPurchaseDate',axis=1, inplace=True)
r_df.head(10)
## CustomerID Recency
## 0 12346 325
## 1 12747 2
## 2 12748 0
## 3 12749 3
## 4 12820 3
## 5 12821 214
## 6 12822 70
## 7 12823 74
## 8 12824 59
## 9 12826 2
sales_2023_copy = sales_2023
print(sales_2023_copy.shape)
## (330402, 9)
sales_2023_copy.drop_duplicates(subset=['InvoiceDate','CustomerID'],keep='first',inplace=True)
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
print(sales_2023_copy.shape)
## (15279, 9)
Count the number of invoices associated with each CustomerID to determine their frequency of purchase.
f_df = sales_2023_copy.groupby(by=['CustomerID'], as_index=False)['InvoiceNo'].count()
f_df.columns = ['CustomerID','Frequency']
f_df.head(10)
## CustomerID Frequency
## 0 12346 1
## 1 12747 9
## 2 12748 176
## 3 12749 5
## 4 12820 4
## 5 12821 1
## 6 12822 2
## 7 12823 5
## 8 12824 1
## 9 12826 6
# Create column total cost
sales_2023['TotalCost'] = sales_2023['Quantity'] * sales_2023['UnitPrice']
## <string>:2: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Calculate the monetary value for each customer by summing their “Total Cost” across all invoices.
m_df = sales_2023.groupby(by='CustomerID', as_index=False).agg({'TotalCost': 'sum'})
m_df.columns = ['CustomerID', 'Monetary']
m_df.head(10)
## CustomerID Monetary
## 0 12346 77183.60
## 1 12747 563.69
## 2 12748 3385.53
## 3 12749 98.35
## 4 12820 58.20
## 5 12821 19.92
## 6 12822 30.80
## 7 12823 1759.50
## 8 12824 5.04
## 9 12826 105.10
# Recency & Frequency
rf_df = r_df.merge(f_df,on='CustomerID')
# Recency & Frequency & Monetary
rfm_df = rf_df.merge(m_df,on='CustomerID')
rfm_df.head(10)
## CustomerID Recency Frequency Monetary
## 0 12346 325 1 77183.60
## 1 12747 2 9 563.69
## 2 12748 0 176 3385.53
## 3 12749 3 5 98.35
## 4 12820 3 4 58.20
## 5 12821 214 1 19.92
## 6 12822 70 2 30.80
## 7 12823 74 5 1759.50
## 8 12824 59 1 5.04
## 9 12826 2 6 105.10
Set the “CustomerID” as the index of the “rfm” dataframe.
rfm_df.set_index('CustomerID',inplace=True)
rfm_df.head(10)
## Recency Frequency Monetary
## CustomerID
## 12346 325 1 77183.60
## 12747 2 9 563.69
## 12748 0 176 3385.53
## 12749 3 5 98.35
## 12820 3 4 58.20
## 12821 214 1 19.92
## 12822 70 2 30.80
## 12823 74 5 1759.50
## 12824 59 1 5.04
## 12826 2 6 105.10
Verify RFM Table: Ensure the resulting “rfm” table accurately
represents the recency, frequency, and monetary values for each
customer.
- TRUE : Correct - FALSE : Incorrect
Cus_Insp_df = sales_2023[sales_2023['CustomerID']=='12820']
Cus_Insp_df
## InvoiceNo StockCode ... date TotalCost
## Unnamed: 0 ...
## 54523 541283 21977 ... 2023-01-17 13.2
## 322249 568236 23328 ... 2023-09-26 15.0
## 374615 572873 23436 ... 2023-10-26 15.0
## 477157 580973 21098 ... 2023-12-06 15.0
##
## [4 rows x 10 columns]
print('Last purchase of 12820',Cus_Insp_df['date'].max())
## Last purchase of 12820 2023-12-06
print('Recency of 12820: ', rfm_df.loc['12820','Recency'])
## Recency of 12820: 3
(now - dt.date(2023,12,6)).days == 3
## True
## RFM Quartiles
rfm_qua = rfm_df.quantile(q=[0.25,0.5,0.75])
rfm_qua
## Recency Frequency Monetary
## 0.25 17.0 1.0 17.40
## 0.50 47.0 2.0 44.26
## 0.75 128.0 4.0 118.80
# Transform to Dict data type
rfm_qua.to_dict()
## {'Recency': {0.25: 17.0, 0.5: 47.0, 0.75: 128.0}, 'Frequency': {0.25: 1.0, 0.5: 2.0, 0.75: 4.0}, 'Monetary': {0.25: 17.4, 0.5: 44.26, 0.75: 118.80000000000001}}
Creating RFM Segmentation:
This segmentation will leverage two key classes:
- High recency: This is considered negative because it indicates
customers haven’t purchased recently.
- High frequency and monetary value: These are positive factors
signifying active and valuable customers.
# First segmentation : Recency
# Arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)
def RScore(x,p,d):
if x <= d[p][0.25]:
return 4
elif x <= d[p][0.50]:
return 3
elif x <= d[p][0.75]:
return 2
else:
return 1
# Second segmentation : Frequency & Monetary
# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def FMScore(x,p,d):
if x <= d[p][0.25]:
return 1
elif x <= d[p][0.50]:
return 2
elif x <= d[p][0.75]:
return 3
else:
return 4
Create rfm segmentation table.
#create rfm segmentation table
rfm_segmentation = rfm_df
rfm_segmentation['R'] = rfm_segmentation['Recency'].apply(RScore, args=('Recency',rfm_qua))
rfm_segmentation['F'] = rfm_segmentation['Frequency'].apply(FMScore, args=('Frequency',rfm_qua))
rfm_segmentation['M'] = rfm_segmentation['Monetary'].apply(FMScore, args=('Monetary',rfm_qua))
rfm_segmentation.head(10)
## Recency Frequency Monetary R F M
## CustomerID
## 12346 325 1 77183.60 1 1 4
## 12747 2 9 563.69 4 4 4
## 12748 0 176 3385.53 4 4 4
## 12749 3 5 98.35 4 4 3
## 12820 3 4 58.20 4 3 3
## 12821 214 1 19.92 1 1 2
## 12822 70 2 30.80 2 2 2
## 12823 74 5 1759.50 2 4 4
## 12824 59 1 5.04 2 1 1
## 12826 2 6 105.10 4 4 3
With individual customer scores in hand, we’re ready to represent our
customer segmentation.
Our first step is to combine the R, F, and M scores into a single
representation.
rfm_segmentation['RFM_Score'] = rfm_segmentation.R.map(str) \
+ rfm_segmentation.F.map(str) \
+ rfm_segmentation.M.map(str)
rfm_segmentation.head(10)
## Recency Frequency Monetary R F M RFM_Score
## CustomerID
## 12346 325 1 77183.60 1 1 4 114
## 12747 2 9 563.69 4 4 4 444
## 12748 0 176 3385.53 4 4 4 444
## 12749 3 5 98.35 4 4 3 443
## 12820 3 4 58.20 4 3 3 433
## 12821 214 1 19.92 1 1 2 112
## 12822 70 2 30.80 2 2 2 222
## 12823 74 5 1759.50 2 4 4 244
## 12824 59 1 5.04 2 1 1 211
## 12826 2 6 105.10 4 4 3 443
Our RFM analysis assigns the highest score of 4 to customers
with:
- Highest Recency: Most recent purchase.
- Highest Frequency: Largest number of purchases.
- Highest Monetary: Highest total spending.
Let’s identify these “Champion” customers, representing our most
valuable segment.
rfm_segmentation[rfm_segmentation['RFM_Score']=='444'].sort_values('Monetary', ascending=False).head(10)
## Recency Frequency Monetary R F M RFM_Score
## CustomerID
## 18102 0 52 36181.65 4 4 4 444
## 17949 1 44 28685.69 4 4 4 444
## 17450 8 44 25953.51 4 4 4 444
## 16333 7 21 14113.68 4 4 4 444
## 16013 3 39 13981.48 4 4 4 444
## 15769 7 24 11390.84 4 4 4 444
## 12901 8 28 9230.45 4 4 4 444
## 13798 1 55 7617.26 4 4 4 444
## 17857 4 21 7384.68 4 4 4 444
## 13694 3 48 7265.06 4 4 4 444
This section provides suggested customer segments for further
analysis. Note that the linked source uses an opposite scoring system,
where 1 is the highest/best score and 4 is the lowest.
Before proceeding, let’s determine the number of customers belonging to
each segment based on our scoring system.
print("Best Customers: ",len(rfm_segmentation[rfm_segmentation['RFM_Score']=='444']))
## Best Customers: 374
print('Loyal Customers: ',len(rfm_segmentation[rfm_segmentation['F']==4]))
## Loyal Customers: 928
print("Big Spenders: ",len(rfm_segmentation[rfm_segmentation['M']==4]))
## Big Spenders: 953
print('Almost Lost: ', len(rfm_segmentation[rfm_segmentation['RFM_Score']=='244']))
## Almost Lost: 85
print('Lost Customers: ',len(rfm_segmentation[rfm_segmentation['RFM_Score']=='144']))
## Lost Customers: 10
print('Lost Cheap Customers: ',len(rfm_segmentation[rfm_segmentation['RFM_Score']=='111']))
## Lost Cheap Customers: 352
2.1) Define cluster code.
rfm_segment = rfm_segmentation.copy()
# Create function for define cluster code
def assign_clusterCode(row):
if row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] > 3.4 and row['R'] <= 4:
return 1
elif row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] > 2.2 and row['R'] <= 3.4:
return 2
elif row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] >= 1 and row['R'] <= 2.2:
return 3
elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] > 2.8 and row['R'] <= 4:
return 4
elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] > 2.2 and row['R'] <= 2.8:
return 5
elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] >= 1 and row['R'] <= 2.2:
return 6
elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 3.4 and row['R'] <= 4:
return 7
elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 2.8 and row['R'] <= 3.4:
return 8
elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 2.2 and row['R'] <= 2.8:
return 9
else:
return 10
# Apply function to each row to create 'Cluster' column
rfm_segment['ClusterCode'] = rfm_segment.apply(assign_clusterCode, axis=1)
rfm_segment.head(10)
## Recency Frequency Monetary R F M RFM_Score ClusterCode
## CustomerID
## 12346 325 1 77183.60 1 1 4 114 10
## 12747 2 9 563.69 4 4 4 444 1
## 12748 0 176 3385.53 4 4 4 444 1
## 12749 3 5 98.35 4 4 3 443 1
## 12820 3 4 58.20 4 3 3 433 4
## 12821 214 1 19.92 1 1 2 112 10
## 12822 70 2 30.80 2 2 2 222 10
## 12823 74 5 1759.50 2 4 4 244 3
## 12824 59 1 5.04 2 1 1 211 10
## 12826 2 6 105.10 4 4 3 443 1
2.2) Define cluster name.
# Create function for define cluster name 10
def assign_clusterName(row):
if row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] > 3.4 and row['R'] <= 4:
return 'Champion'
elif row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] > 2.2 and row['R'] <= 3.4:
return 'Loyal'
elif row['F'] * row['M'] > 10 and row['F'] * row['M'] <= 16 and row['R'] >= 1 and row['R'] <= 2.2:
return 'Cannot Lose them'
elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] > 2.8 and row['R'] <= 4:
return 'Potential Loyalist'
elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] > 2.2 and row['R'] <= 2.8:
return 'Need Attention'
elif row['F'] * row['M'] > 6 and row['F'] * row['M'] <= 10 and row['R'] >= 1 and row['R'] <= 2.2:
return 'At risk'
elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 3.4 and row['R'] <= 4:
return 'New Customer'
elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 2.8 and row['R'] <= 3.4:
return 'Promising'
elif row['F'] * row['M'] >= 1 and row['F'] * row['M'] <= 6 and row['R'] > 2.2 and row['R'] <= 2.8:
return 'About to Sleep'
else:
return 'Hibernating'
# Apply function to each row to create 'Cluster' column
rfm_segment['ClusterName'] = rfm_segment.apply(assign_clusterName, axis=1)
rfm_segment.head(10)
## Recency Frequency ... ClusterCode ClusterName
## CustomerID ...
## 12346 325 1 ... 10 Hibernating
## 12747 2 9 ... 1 Champion
## 12748 0 176 ... 1 Champion
## 12749 3 5 ... 1 Champion
## 12820 3 4 ... 4 Potential Loyalist
## 12821 214 1 ... 10 Hibernating
## 12822 70 2 ... 10 Hibernating
## 12823 74 5 ... 3 Cannot Lose them
## 12824 59 1 ... 10 Hibernating
## 12826 2 6 ... 1 Champion
##
## [10 rows x 9 columns]
2.3) Create FxM attribute.
# Create column FxM
rfm_segment['FxM'] = rfm_segment['F'] * rfm_segment['M']
rfm_segment.head(10)
## Recency Frequency Monetary ... ClusterCode ClusterName FxM
## CustomerID ...
## 12346 325 1 77183.60 ... 10 Hibernating 4
## 12747 2 9 563.69 ... 1 Champion 16
## 12748 0 176 3385.53 ... 1 Champion 16
## 12749 3 5 98.35 ... 1 Champion 12
## 12820 3 4 58.20 ... 4 Potential Loyalist 9
## 12821 214 1 19.92 ... 10 Hibernating 2
## 12822 70 2 30.80 ... 10 Hibernating 4
## 12823 74 5 1759.50 ... 3 Cannot Lose them 16
## 12824 59 1 5.04 ... 10 Hibernating 1
## 12826 2 6 105.10 ... 1 Champion 12
##
## [10 rows x 10 columns]
2.4 ) Create customer segmentation by scatter plot.
# Define customer segments
segments = {
1: "Champion",
2: "Loyal",
3: "Cannot loss them",
4: "Potential Loyalist",
5: "Need Attention",
6: "At risk",
7: "New Customer",
8: "Promising",
9: "About to Sleep",
10: "Hibernating"
}
# Define colors for each segment
colors = {
"Champion" : "gold",
"Loyal" : "silver",
"Cannot loss them" : "orange",
"Potential Loyalist" : "green",
"Need Attention" : "blue",
"At risk" : "red",
"New Customer" : "pink",
"Promising" : "brown",
"About to Sleep" : "gray",
"Hibernating" : "black"
}
# Plot scatter plot with legend
plt.figure(figsize=(10, 6))
for segment, group in rfm_segment.groupby('ClusterCode'):
label = segments[segment]
color = colors[label]
plt.scatter(group['R'], group['FxM'], label=label, color=color)
# Annotate each point with CustomerID
for customer_id, r, f in zip(group.index, group['R'], group['FxM']):
plt.annotate(customer_id, (r, f), fontsize=8)
plt.xlabel('Recency')
plt.ylabel('Frequency X Monetary')
plt.title('Customer Segmentation')
plt.legend(loc='right', bbox_to_anchor=(1.4, 0.5), ncol=1)
plt.show()
## CSV file exporting to create report by BI tool.
rfm_segment.to_csv('Cluster_data.csv')
Create a sales analysis report and perform customer segmentation using RFM Analysis. Use the BI tool Tableau Public to display the results as follows:
You can visit the report here: Sales RFM Analysis Report
RFM analysis is a powerful tool that can be used to segment customers and target marketing campaigns. By understanding the different purchasing behaviors of customers, businesses can create marketing campaigns that are more likely to be successful.