Introduction
K-Means, PAM, ClARA
RFM
Dataset and preprocessing
Data Description
Data cleaning
RFM features
Outliers handing
Standarlization
Clustering(K-means, PAM, CLARA)
Assessing Clustering Tendency
Optimal clusters
Clustering
Measuring clustering quality
Visualization
Interpretation and Analysis
What’s customer segmentation? Customer Segmentation, also known as customer classification, refers to the division of customers into multiple categories according to certain rules and according to the attributes and characteristics of customers. Then enterprises can provide differentiated services to them.
My main idea for offering this topic is interest in customer segmentation and unsupervised learning. Customer data comes from Chinese e-commerce platforms, dataset can be found here. Which is customer data from online shopping website form 2015 to 2018. Including 15 features and a total of 393,482 rows of data. The purpose is to dig out the consumption characteristics of different customers, effectively classify customers for refined operation.
Definition of K-Means is “k-means clustering is a method of vector quantization, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid)(more intro taken from here: https://en.wikipedia.org/wiki/K-means_clustering
PAM: the main difference between K-means and PAM method is that K-means uses centroids (usually artifficial points), while PAM uses medodoids, which are always the actual points in the dataset.
CLARA: a clustering method in large-scale applications, this algorithm is actually similar to the PAM algorithm. but it does not consider the entire data set, selects a small part of the data as a sample, it extracts multiple sample sets from the data set, Use PAM for each sample set and use the best cluster as output
Definition of RFM is “RFM analysis is a marketing technique used to quantitatively rank and group customers based on the recency, frequency and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns. The system assigns each customer numerical scores based on these factors to provide an objective analysis. RFM analysis is based on the marketing adage that “80% of your business comes from 20% of your customers.”(more intro taken from here:https://www.techtarget.com/searchdatamanagement/definition/RFM-analysis
RFM analysis ranks each customer on the following factors:
Import data and find that the amount of data (393,482) is too large,when clustering this amount of data, it will bring great challenges to computer performance and time. Since this project only introduces the clustering method, I use the sample() function to only obtain 9102 rows of data for clustering analysis.
data=read.csv('Datasets/consumer_data.csv',sep=',',dec='.')
#Random sampling of 9102 rows of data
set.seed(123)
a=sample(c(1:393482),9102,replace = F)
data1=data[a,]
str(data1)
## 'data.frame': 9102 obs. of 15 variables:
## $ Column1 : int 384527 282209 263445 332633 450932 693898 264290 393072 96961 139873 ...
## $ accountnum : chr "359165c1" "70b4f12a" "affc22d8" "3774e574" ...
## $ time : chr "2016/11/8 17:08" "2016/6/11 16:59" "2016/5/16 12:59" "2016/8/29 12:01" ...
## $ product_code : chr "a30385d4" "70434365" "981d9714" "e7cd084e" ...
## $ sales_volume : int 1 1 1 1 1 1 1 1 1 1 ...
## $ unit_price : int 78 2592 1499 840 1150 350 780 674 2150 206 ...
## $ money : num 78 2592 1499 840 1150 ...
## $ member_point : num 78 2592 1499 840 1150 ...
## $ document_no : chr "d399" "25bb" "6950" "25bb" ...
## $ gender : int 0 0 0 0 0 0 0 0 0 0 ...
## $ register_time: chr "2015/8/26 0:00" "2015/3/22 0:00" "2016/5/16 12:55" "2010/5/3 0:00" ...
## $ quarter : int 4 2 2 3 1 4 2 4 2 2 ...
## $ month : int 11 6 5 8 1 11 5 11 4 5 ...
## $ year : int 2016 2016 2016 2016 2017 2017 2016 2016 2015 2015 ...
## $ day : int 8 11 16 29 10 26 17 24 1 28 ...
head(data1)
## Column1 accountnum time product_code sales_volume unit_price
## 188942 384527 359165c1 2016/11/8 17:08 a30385d4 1 78
## 134058 282209 70b4f12a 2016/6/11 16:59 70434365 1 2592
## 124022 263445 affc22d8 2016/5/16 12:59 981d9714 1 1499
## 160997 332633 3774e574 2016/8/29 12:01 e7cd084e 1 840
## 226318 450932 174617ee 2017/1/10 18:46 9c64cfd6 1 1150
## 365209 693898 880a4c89 2017/11/26 14:42 9da0de8e 1 350
## money member_point document_no gender register_time quarter month year
## 188942 78 78 d399 0 2015/8/26 0:00 4 11 2016
## 134058 2592 2592 25bb 0 2015/3/22 0:00 2 6 2016
## 124022 1499 1499 6950 0 2016/5/16 12:55 2 5 2016
## 160997 840 840 25bb 0 2010/5/3 0:00 3 8 2016
## 226318 1150 1150 dad2 0 2012/10/15 0:00 1 1 2017
## 365209 315 315 dad2 0 2017/10/11 0:00 4 11 2017
## day
## 188942 8
## 134058 11
## 124022 16
## 160997 29
## 226318 10
## 365209 26
data1=data1[,-1]
data1[!complete.cases(data1),]
## [1] accountnum time product_code sales_volume unit_price
## [6] money member_point document_no gender register_time
## [11] quarter month year day
## <0 rows> (or 0-length row.names)
There is no missing data.
data1[duplicated(data1),]
## [1] accountnum time product_code sales_volume unit_price
## [6] money member_point document_no gender register_time
## [11] quarter month year day
## <0 rows> (or 0-length row.names)
There is no duplicated values
Some clustering algorithms such as K-Means are easily disturbed by outliers. It’s necessary to find and eliminate outliers. Firstly, getting features which are good for RFM method. Then, using dotchart to see if there’re outliers and remove these outlier samples.
Feature extraction is to obtain variables for clustering, eliminate the interference of irrelevant features, and improve the accuracy of cluster analysis.
consume_interval: time interval between consumption behavior occurred and the present time. Since the type of data$time is “2016/11/8 17:08”, I use as.POSIXct() function to format date and time, and instead of Sys.Date(), I use max() date, because the data is achived from a long time ago.
register_interval: time interval between register behavior occurred and the present time.
data1$consume_interval=as.Date(max(as.POSIXct(data1$time)))-as.Date(as.POSIXct(data1$time))
data1$register_interval=Sys.Date()-as.Date(as.POSIXct(data1$register_time))
#then,extract features we want
data1=data1[,c('accountnum','sales_volume','money','consume_interval','register_interval')]
#Convert the data of the newly created time mode to numeric
data1$consume_interval=as.numeric(data1$consume_interval)
data1$register_interval=as.numeric(data1$register_interval)
str(data1)
## 'data.frame': 9102 obs. of 5 variables:
## $ accountnum : chr "359165c1" "70b4f12a" "affc22d8" "3774e574" ...
## $ sales_volume : int 1 1 1 1 1 1 1 1 1 1 ...
## $ money : num 78 2592 1499 840 1150 ...
## $ consume_interval : num 421 571 597 492 358 ...
## $ register_interval: num 2709 2866 2444 4650 3754 ...
Transforming features is based on the original features to generate new features which is appropriate for clustering, so now I will generate these new features:
Recency: according to the last consumption time from today.
Frequency: count for the number of consumption.
Monetary: the sum of money.
Log: the registration time form today, which indicates whether our consumer is a new or old customer, so, in addition to the RFM feature, I also added the feature.
# dplyr::summarise used for statistical description of grouped data and use min() function to get consumer's latest purchase time
recency<-data1 %>% group_by(accountnum) %>% dplyr::summarise(min(consume_interval))
#Calculate how often shoppers shopped during this time
frequency=data1%>%group_by(accountnum)%>%dplyr::summarise(n=n())
monetary=data1%>%group_by(accountnum)%>%dplyr::summarise(sum(money))
log=data1%>%group_by(accountnum)%>%dplyr::summarise(max(register_interval))
#all the data we are interested in by column
data1=bind_cols(recency,frequency[,-1],monetary[,-1],log[,-1])
data1
## # A tibble: 6,428 × 5
## accountnum `min(consume_interval)` n `sum(money)` max(register_interval…¹
## <chr> <dbl> <int> <dbl> <dbl>
## 1 000339f1 18 1 288 3417
## 2 000cd735 220 2 8460 3524
## 3 000e0c35 266 1 420 2317
## 4 0015a268 393 1 620 2240
## 5 0016caca 1069 1 2004 5122
## 6 001772f3 220 1 1050 5070
## 7 001e604d 920 1 1798 3527
## 8 0023bd95 1071 1 7984 2918
## 9 004452c7 418 1 6167 2407
## 10 0047a1e2 886 1 880 3568
## # … with 6,418 more rows, and abbreviated variable name
## # ¹`max(register_interval)`
Rename the columns.
colnames(data1)=c('id','recency','frequency','monetary','log')
head(data1)
## # A tibble: 6 × 5
## id recency frequency monetary log
## <chr> <dbl> <int> <dbl> <dbl>
## 1 000339f1 18 1 288 3417
## 2 000cd735 220 2 8460 3524
## 3 000e0c35 266 1 420 2317
## 4 0015a268 393 1 620 2240
## 5 0016caca 1069 1 2004 5122
## 6 001772f3 220 1 1050 5070
using dotchart to find outliers
From figures, we can find 2 features(such as frequency, monetary) have some outliers. When sample points with frequency greater than 10, and monetary greater than 50,000 are obviously outliers and need to be deleted.
data1=data1[data1$frequency<=10&data1$monetary<=50000,]
Now there are no extreme outliers, so cluster analysis can be done with above data.
Feature standardization helps to adjust all data elements to a common scale to improve the performance of clustering algorithms. For example, in our dataset, monetary can have more than 4 digits, while frequency is only one or two digits. Since the data in these variables are of different scales, it is difficult to compare these variables. So using scale() function to standardize data.
data_new=scale(data1[,2:5],center = T, scale = T)
head(data_new)
## recency frequency monetary log
## [1,] -1.2587380 -0.4328713 -0.522877618 0.1622841
## [2,] -0.6715140 0.6454421 2.011975512 0.2575324
## [3,] -0.5377897 -0.4328713 -0.481932854 -0.8169032
## [4,] -0.1685944 -0.4328713 -0.419895332 -0.8854463
## [5,] 1.7965712 -0.4328713 0.009404317 1.6800245
## [6,] -0.6715140 -0.4328713 -0.286514661 1.6337356
Before clustering, I want to evaluate the clustering trend of the data in advance. The Hopkins statistic is to test the spatial randomness, thereby judging whether the data can be clustered. If the value is close t 1(much higher than 0.5), then it can be concluded that the dataset is significantly clusterable.
The value 0.94 is close to 1, so customers data is significantly clusterable.
To choosing the optimal number of clusters, I will use silhouette. However, WSS(total within sum of square) will also be used to see whether there’s difference.
We can see using Silhouette method for K-Means, PAM, CLARA, the optimal number is 4 clusters.
Using WSS, the results for k-means, pam and clara are similiar, and number of cluster=4 there has been substantial decrease(an elbow) hence, we can choose 4 clusters.
Comparing K-Means, PAM, and CLARA
km=eclust(data_new,'kmeans', hc_metric = 'euclidean',k=4,graph = F)
pam=eclust(data_new,'pam', hc_metric = 'euclidean',k=4,graph = F)
clara=eclust(data_new,'clara', hc_metric = 'euclidean',k=4,graph = F)
Comparing the plot, we can see that the CLARA clustering results differ from PAM and K-Means in Group 1. For a more in-depth comparison, I go to the next step.
Since the CLARA with 4 clusters plot has significant overlap. So then, I will use Silhouette and Calinski-Harabsasz to check the clustering quality.
km$silinfo$clus.avg.widths
## [1] 0.2665761 0.4368819 0.1056522 0.4343339
pam$silinfo$clus.avg.widths
## [1] 0.49970787 0.00193485 0.43988465 0.23260873
clara$silinfo$clus.avg.widths
## [1] 0.475685490 -0.003846953 0.379344731 0.291522383
Silhouette shows that CLARA with k=4 has a negative average silhouette value, which means that this clusterizations are not good.
From these figures, Silhouette shows that PAM and CLARA have similar clustering results, but the differences between K-Means and PAM is existed. So in the next step, we will use Calinski-Harabsasz to have a deeper look.
Calinski-Harabsasz is a very popular measure of quality of clustering for K-Means. The higher statistics the better. It’s usually used for comparing solutions for number of clusters.
round(calinhara(data_new,km$cluster),2)
## [1] 2721.5
round(calinhara(data_new,pam$cluster),2)
## [1] 2559.17
round(calinhara(data_new,clara$cluster),2)
## [1] 2359.23
Calinski-Harabsaza show the K-Means with k=4 is good. So next, we will take a look into those clusters for K-Means clusterizations
ceters=km$centers #clustering center location
ceters
## recency frequency monetary log
## 1 -0.2697121 -0.1355452 -0.1426494 1.3923002
## 2 1.4654730 -0.2974777 -0.1990241 0.1816523
## 3 -0.6437804 2.5541269 2.2690184 -0.1394450
## 4 -0.5325828 -0.1738509 -0.1780270 -0.6763795
Cluster center coordinate restoration
ceters[,1]=ceters[,1]*sd(data1$recency)+mean(data1$recency)
ceters[,2]=ceters[,2]*sd(data1$frequency)+mean(data1$frequency)
ceters[,3]=ceters[,3]*sd(data1$monetary)+mean(data1$monetary)
ceters[,4]=ceters[,4]*sd(data1$log)+mean(data1$log)
ceters
## recency frequency monetary log
## 1 358.2164 1.275733 1513.801 4798.776
## 2 955.1051 1.125561 1332.057 3438.758
## 3 229.5401 3.770065 9288.669 3078.043
## 4 267.7911 1.240209 1399.749 2474.862
Draw a radar chart which we can see the difference about the cluster centers.
# prepare for the radar data, which should contain the max and min value
ceters=data.frame(ceters)
min=c(220,1.1,1300,2400)
max=c(960,3.8,9300,4800)
radar=rbind(max,min,ceters)
library(fmsb)
par(pin=c(3.8,3.8))
radarchart(
radar, axistype = 1,
# Customize the polygon
pcol = c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'), pfcol = scales::alpha(c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'),0.4), plwd = 3, plty = 1,
# Customize the grid
cglcol = "grey", cglty = 1, cglwd = 0.5,
# Customize the axis
axislabcol = "grey",
# Variable labels
vlcex = 0.8, vlabels = colnames(radar),
caxislabels = c(0, 5, 10, 15, 20))
# Add an horizontal legend
legend(
x = "topright",
legend = c('group1','group2','group3','group4'),
horiz = TRUE,bty = "n", pch = 20 ,
col = c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'),
text.col = "black", cex = 1, pt.cex = 1.5
)
group1: customers who registered the earliest time are the early customers of the mall, and these customers still spend from time to time, but the frequency and monetary are low. Since registration, neither the number times nor the amount of consumption is large.
group2: the time since the last consumption is the shortest, although the frequency and monetary are low, but the consumption has been relatively active, they can be motivated by some promotional tools.
group3: customers are the most valuable in consumption and the cumulative consumption amount are the largest, but latest they are not quite active. They are high-value customers and need to be maintained.
group4: customers are the latest to register but have worst consumption behaviors. They are likely to be churned customers and need to be activated
#clustering medoids location
ceters_pam=pam$medoids
ceters_pam
## recency frequency monetary log
## [1,] -0.5203474 -0.4328713 -0.3423484 -0.783966892
## [2,] -0.6860492 1.7237555 1.0448106 -0.456384222
## [3,] 1.4680746 -0.4328713 -0.3640616 -0.006848221
## [4,] -0.2441777 -0.4328713 -0.2379083 1.115656525
#Cluster medoids coordinate restoration
ceters_pam[,1]=ceters_pam[,1]*sd(data1$recency)+mean(data1$recency)
ceters_pam[,2]=ceters_pam[,2]*sd(data1$frequency)+mean(data1$frequency)
ceters_pam[,3]=ceters_pam[,3]*sd(data1$monetary)+mean(data1$monetary)
ceters_pam[,4]=ceters_pam[,4]*sd(data1$log)+mean(data1$log)
ceters_pam
## recency frequency monetary log
## [1,] 272 1 870.0 2354
## [2,] 215 3 5342.0 2722
## [3,] 956 1 800.0 3227
## [4,] 367 1 1206.7 4488
# prepare for the radar data, which should contain the max and min value
ceters_pam=data.frame(ceters_pam)
min_pam=c(210,0,700,2300)
max_pam=c(1000,4,5400,4500)
radar_pam=rbind(max_pam,min_pam,ceters_pam)
#draw the radar plot
par(pin=c(3.8,3.8))
radarchart(
radar_pam, axistype = 1,
# Customize the polygon
pcol = c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'), pfcol = scales::alpha(c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'),0.4), plwd = 3, plty = 1,
# Customize the grid
cglcol = "grey", cglty = 1, cglwd = 0.5,
# Customize the axis
axislabcol = "grey",
# Variable labels
vlcex = 0.8, vlabels = colnames(radar),
caxislabels = c(0, 5, 10, 15, 20))
# Add an horizontal legend
legend(
x = "topright",
legend = c('group1','group2','group3','group4'),
horiz = TRUE,bty = "n", pch = 20 ,
col = c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'),
text.col = "black", cex = 1, pt.cex = 1.5
)
Comparing the radar charts of K-Means and PAM, it is found that the grouping results are similar. Therefore, we can use both classification.
Firstly, introducing the classification results into the customer data.
data1$segmentation=km$cluster
head(data1,10)
## # A tibble: 10 × 6
## id recency frequency monetary log segmentation
## <chr> <dbl> <int> <dbl> <dbl> <int>
## 1 000339f1 18 1 288 3417 4
## 2 000cd735 220 2 8460 3524 3
## 3 000e0c35 266 1 420 2317 4
## 4 0015a268 393 1 620 2240 4
## 5 0016caca 1069 1 2004 5122 2
## 6 001772f3 220 1 1050 5070 1
## 7 001e604d 920 1 1798 3527 2
## 8 0023bd95 1071 1 7984 2918 2
## 9 004452c7 418 1 6167 2407 4
## 10 0047a1e2 886 1 880 3568 2
The proportion fo people in different categories.
hist(data1$segmentation,col ='orange',axes=F)
axis(1,c(1,2,3,4),c('old custmer','active custmer','VIP customer','lost customer'))