1. Introduction

    • K-Means, PAM, ClARA

    • RFM

  2. Dataset and preprocessing

    • Data Description

    • Data cleaning

    • RFM features

    • Outliers handing

    • Standarlization

  3. Clustering(K-means, PAM, CLARA)

    • Assessing Clustering Tendency

    • Optimal clusters

    • Clustering

    • Measuring clustering quality

    • Visualization

  4. Interpretation and Analysis

1 Introduction

What’s customer segmentation? Customer Segmentation, also known as customer classification, refers to the division of customers into multiple categories according to certain rules and according to the attributes and characteristics of customers. Then enterprises can provide differentiated services to them.

My main idea for offering this topic is interest in customer segmentation and unsupervised learning. Customer data comes from Chinese e-commerce platforms, dataset can be found here. Which is customer data from online shopping website form 2015 to 2018. Including 15 features and a total of 393,482 rows of data. The purpose is to dig out the consumption characteristics of different customers, effectively classify customers for refined operation.

1.1 K-Means, PAM, CLARA

Definition of K-Means is “k-means clustering is a method of vector quantization, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid)(more intro taken from here: https://en.wikipedia.org/wiki/K-means_clustering

PAM: the main difference between K-means and PAM method is that K-means uses centroids (usually artifficial points), while PAM uses medodoids, which are always the actual points in the dataset.

CLARA: a clustering method in large-scale applications, this algorithm is actually similar to the PAM algorithm. but it does not consider the entire data set, selects a small part of the data as a sample, it extracts multiple sample sets from the data set, Use PAM for each sample set and use the best cluster as output

2 Dataset and preprocessing

2.1Data Description

Import data and find that the amount of data (393,482) is too large,when clustering this amount of data, it will bring great challenges to computer performance and time. Since this project only introduces the clustering method, I use the sample() function to only obtain 9102 rows of data for clustering analysis.

data=read.csv('Datasets/consumer_data.csv',sep=',',dec='.')
#Random sampling of 9102 rows of data
set.seed(123)
a=sample(c(1:393482),9102,replace = F)
data1=data[a,]
str(data1)
## 'data.frame':    9102 obs. of  15 variables:
##  $ Column1      : int  384527 282209 263445 332633 450932 693898 264290 393072 96961 139873 ...
##  $ accountnum   : chr  "359165c1" "70b4f12a" "affc22d8" "3774e574" ...
##  $ time         : chr  "2016/11/8 17:08" "2016/6/11 16:59" "2016/5/16 12:59" "2016/8/29 12:01" ...
##  $ product_code : chr  "a30385d4" "70434365" "981d9714" "e7cd084e" ...
##  $ sales_volume : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ unit_price   : int  78 2592 1499 840 1150 350 780 674 2150 206 ...
##  $ money        : num  78 2592 1499 840 1150 ...
##  $ member_point : num  78 2592 1499 840 1150 ...
##  $ document_no  : chr  "d399" "25bb" "6950" "25bb" ...
##  $ gender       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ register_time: chr  "2015/8/26 0:00" "2015/3/22 0:00" "2016/5/16 12:55" "2010/5/3 0:00" ...
##  $ quarter      : int  4 2 2 3 1 4 2 4 2 2 ...
##  $ month        : int  11 6 5 8 1 11 5 11 4 5 ...
##  $ year         : int  2016 2016 2016 2016 2017 2017 2016 2016 2015 2015 ...
##  $ day          : int  8 11 16 29 10 26 17 24 1 28 ...
head(data1)
##        Column1 accountnum             time product_code sales_volume unit_price
## 188942  384527   359165c1  2016/11/8 17:08     a30385d4            1         78
## 134058  282209   70b4f12a  2016/6/11 16:59     70434365            1       2592
## 124022  263445   affc22d8  2016/5/16 12:59     981d9714            1       1499
## 160997  332633   3774e574  2016/8/29 12:01     e7cd084e            1        840
## 226318  450932   174617ee  2017/1/10 18:46     9c64cfd6            1       1150
## 365209  693898   880a4c89 2017/11/26 14:42     9da0de8e            1        350
##        money member_point document_no gender   register_time quarter month year
## 188942    78           78        d399      0  2015/8/26 0:00       4    11 2016
## 134058  2592         2592        25bb      0  2015/3/22 0:00       2     6 2016
## 124022  1499         1499        6950      0 2016/5/16 12:55       2     5 2016
## 160997   840          840        25bb      0   2010/5/3 0:00       3     8 2016
## 226318  1150         1150        dad2      0 2012/10/15 0:00       1     1 2017
## 365209   315          315        dad2      0 2017/10/11 0:00       4    11 2017
##        day
## 188942   8
## 134058  11
## 124022  16
## 160997  29
## 226318  10
## 365209  26

2.2 Data Cleaning

2.2.1 Looking for null or missing values

data1=data1[,-1]
data1[!complete.cases(data1),]
##  [1] accountnum    time          product_code  sales_volume  unit_price   
##  [6] money         member_point  document_no   gender        register_time
## [11] quarter       month         year          day          
## <0 rows> (or 0-length row.names)

There is no missing data.

2.2.2 Looking for duplicated values

data1[duplicated(data1),]
##  [1] accountnum    time          product_code  sales_volume  unit_price   
##  [6] money         member_point  document_no   gender        register_time
## [11] quarter       month         year          day          
## <0 rows> (or 0-length row.names)

There is no duplicated values

2.3 RFM Features

Some clustering algorithms such as K-Means are easily disturbed by outliers. It’s necessary to find and eliminate outliers. Firstly, getting features which are good for RFM method. Then, using dotchart to see if there’re outliers and remove these outlier samples.

2.3.1 Feature extraction and transformation

Feature extraction is to obtain variables for clustering, eliminate the interference of irrelevant features, and improve the accuracy of cluster analysis.

  • consume_interval: time interval between consumption behavior occurred and the present time. Since the type of data$time is “2016/11/8 17:08”, I use as.POSIXct() function to format date and time, and instead of Sys.Date(), I use max() date, because the data is achived from a long time ago.

  • register_interval: time interval between register behavior occurred and the present time.

data1$consume_interval=as.Date(max(as.POSIXct(data1$time)))-as.Date(as.POSIXct(data1$time))
data1$register_interval=Sys.Date()-as.Date(as.POSIXct(data1$register_time))
#then,extract features we want
data1=data1[,c('accountnum','sales_volume','money','consume_interval','register_interval')]
#Convert the data of the newly created time mode to numeric
data1$consume_interval=as.numeric(data1$consume_interval)
data1$register_interval=as.numeric(data1$register_interval)
str(data1)
## 'data.frame':    9102 obs. of  5 variables:
##  $ accountnum       : chr  "359165c1" "70b4f12a" "affc22d8" "3774e574" ...
##  $ sales_volume     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ money            : num  78 2592 1499 840 1150 ...
##  $ consume_interval : num  421 571 597 492 358 ...
##  $ register_interval: num  2709 2866 2444 4650 3754 ...

2.3.2 Recency, frequency, and monetary (RFM)

Transforming features is based on the original features to generate new features which is appropriate for clustering, so now I will generate these new features:

  • Recency: according to the last consumption time from today.

  • Frequency: count for the number of consumption.

  • Monetary: the sum of money.

  • Log: the registration time form today, which indicates whether our consumer is a new or old customer, so, in addition to the RFM feature, I also added the feature.

# dplyr::summarise used for statistical description of grouped data and use min() function to get consumer's latest purchase time
recency<-data1 %>% group_by(accountnum) %>% dplyr::summarise(min(consume_interval))
#Calculate how often shoppers shopped during this time
frequency=data1%>%group_by(accountnum)%>%dplyr::summarise(n=n())
monetary=data1%>%group_by(accountnum)%>%dplyr::summarise(sum(money))
log=data1%>%group_by(accountnum)%>%dplyr::summarise(max(register_interval))
#all the data we are interested in by column
data1=bind_cols(recency,frequency[,-1],monetary[,-1],log[,-1])
data1
## # A tibble: 6,428 × 5
##    accountnum `min(consume_interval)`     n `sum(money)` max(register_interval…¹
##    <chr>                        <dbl> <int>        <dbl>                   <dbl>
##  1 000339f1                        18     1          288                    3417
##  2 000cd735                       220     2         8460                    3524
##  3 000e0c35                       266     1          420                    2317
##  4 0015a268                       393     1          620                    2240
##  5 0016caca                      1069     1         2004                    5122
##  6 001772f3                       220     1         1050                    5070
##  7 001e604d                       920     1         1798                    3527
##  8 0023bd95                      1071     1         7984                    2918
##  9 004452c7                       418     1         6167                    2407
## 10 0047a1e2                       886     1          880                    3568
## # … with 6,418 more rows, and abbreviated variable name
## #   ¹​`max(register_interval)`

Rename the columns.

colnames(data1)=c('id','recency','frequency','monetary','log')
head(data1)
## # A tibble: 6 × 5
##   id       recency frequency monetary   log
##   <chr>      <dbl>     <int>    <dbl> <dbl>
## 1 000339f1      18         1      288  3417
## 2 000cd735     220         2     8460  3524
## 3 000e0c35     266         1      420  2317
## 4 0015a268     393         1      620  2240
## 5 0016caca    1069         1     2004  5122
## 6 001772f3     220         1     1050  5070

2.4 Outlier Removal

using dotchart to find outliers

From figures, we can find 2 features(such as frequency, monetary) have some outliers. When sample points with frequency greater than 10, and monetary greater than 50,000 are obviously outliers and need to be deleted.

data1=data1[data1$frequency<=10&data1$monetary<=50000,]

Now there are no extreme outliers, so cluster analysis can be done with above data.

2.5 Feature Standardization

Feature standardization helps to adjust all data elements to a common scale to improve the performance of clustering algorithms. For example, in our dataset, monetary can have more than 4 digits, while frequency is only one or two digits. Since the data in these variables are of different scales, it is difficult to compare these variables. So using scale() function to standardize data.

data_new=scale(data1[,2:5],center = T, scale = T)
head(data_new)
##         recency  frequency     monetary        log
## [1,] -1.2587380 -0.4328713 -0.522877618  0.1622841
## [2,] -0.6715140  0.6454421  2.011975512  0.2575324
## [3,] -0.5377897 -0.4328713 -0.481932854 -0.8169032
## [4,] -0.1685944 -0.4328713 -0.419895332 -0.8854463
## [5,]  1.7965712 -0.4328713  0.009404317  1.6800245
## [6,] -0.6715140 -0.4328713 -0.286514661  1.6337356

3 Clustering(K-means, PAM, CLARA)

3.1Assessing Clustering Tendency

Before clustering, I want to evaluate the clustering trend of the data in advance. The Hopkins statistic is to test the spatial randomness, thereby judging whether the data can be clustered. If the value is close t 1(much higher than 0.5), then it can be concluded that the dataset is significantly clusterable.

The value 0.94 is close to 1, so customers data is significantly clusterable.

3.2 Optimal number of clusters

To choosing the optimal number of clusters, I will use silhouette. However, WSS(total within sum of square) will also be used to see whether there’s difference.

  • Shilhouette

We can see using Silhouette method for K-Means, PAM, CLARA, the optimal number is 4 clusters.

Using WSS, the results for k-means, pam and clara are similiar, and number of cluster=4 there has been substantial decrease(an elbow) hence, we can choose 4 clusters.

3.3 Clustering

Comparing K-Means, PAM, and CLARA

km=eclust(data_new,'kmeans', hc_metric = 'euclidean',k=4,graph = F)
pam=eclust(data_new,'pam', hc_metric = 'euclidean',k=4,graph = F)
clara=eclust(data_new,'clara', hc_metric = 'euclidean',k=4,graph = F)

Comparing the plot, we can see that the CLARA clustering results differ from PAM and K-Means in Group 1. For a more in-depth comparison, I go to the next step.

3.4 Measuring clustering quality

Since the CLARA with 4 clusters plot has significant overlap. So then, I will use Silhouette and Calinski-Harabsasz to check the clustering quality.

3.4.1 Silhouette

km$silinfo$clus.avg.widths
## [1] 0.2665761 0.4368819 0.1056522 0.4343339
pam$silinfo$clus.avg.widths
## [1] 0.49970787 0.00193485 0.43988465 0.23260873
clara$silinfo$clus.avg.widths
## [1]  0.475685490 -0.003846953  0.379344731  0.291522383

Silhouette shows that CLARA with k=4 has a negative average silhouette value, which means that this clusterizations are not good.

From these figures, Silhouette shows that PAM and CLARA have similar clustering results, but the differences between K-Means and PAM is existed. So in the next step, we will use Calinski-Harabsasz to have a deeper look.

3.4.2 Calinski-Harabsasz

Calinski-Harabsasz is a very popular measure of quality of clustering for K-Means. The higher statistics the better. It’s usually used for comparing solutions for number of clusters.

round(calinhara(data_new,km$cluster),2)
## [1] 2721.5
round(calinhara(data_new,pam$cluster),2)
## [1] 2559.17
round(calinhara(data_new,clara$cluster),2)
## [1] 2359.23

Calinski-Harabsaza show the K-Means with k=4 is good. So next, we will take a look into those clusters for K-Means clusterizations

4 Interpretation and Analysis

4.1 Analysis for K-Means results

ceters=km$centers  #clustering center location
ceters
##      recency  frequency   monetary        log
## 1 -0.2697121 -0.1355452 -0.1426494  1.3923002
## 2  1.4654730 -0.2974777 -0.1990241  0.1816523
## 3 -0.6437804  2.5541269  2.2690184 -0.1394450
## 4 -0.5325828 -0.1738509 -0.1780270 -0.6763795

Cluster center coordinate restoration

ceters[,1]=ceters[,1]*sd(data1$recency)+mean(data1$recency)
ceters[,2]=ceters[,2]*sd(data1$frequency)+mean(data1$frequency)
ceters[,3]=ceters[,3]*sd(data1$monetary)+mean(data1$monetary)
ceters[,4]=ceters[,4]*sd(data1$log)+mean(data1$log)
ceters
##    recency frequency monetary      log
## 1 358.2164  1.275733 1513.801 4798.776
## 2 955.1051  1.125561 1332.057 3438.758
## 3 229.5401  3.770065 9288.669 3078.043
## 4 267.7911  1.240209 1399.749 2474.862

Draw a radar chart which we can see the difference about the cluster centers.

# prepare for the radar data, which should contain the max and min value
ceters=data.frame(ceters)
min=c(220,1.1,1300,2400)
max=c(960,3.8,9300,4800)
radar=rbind(max,min,ceters)
library(fmsb) 
par(pin=c(3.8,3.8))
radarchart(
  radar, axistype = 1,
  # Customize the polygon
  pcol = c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'), pfcol = scales::alpha(c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'),0.4), plwd = 3, plty = 1,
  # Customize the grid
  cglcol = "grey", cglty = 1, cglwd = 0.5,
  # Customize the axis
  axislabcol = "grey", 
  # Variable labels
  vlcex = 0.8, vlabels = colnames(radar),
  caxislabels = c(0, 5, 10, 15, 20))
# Add an horizontal legend
legend(
  x = "topright", 
  legend = c('group1','group2','group3','group4'), 
  horiz = TRUE,bty = "n", pch = 20 , 
  col = c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'),
  text.col = "black", cex = 1, pt.cex = 1.5
)

  • group1: customers who registered the earliest time are the early customers of the mall, and these customers still spend from time to time, but the frequency and monetary are low. Since registration, neither the number times nor the amount of consumption is large.

  • group2: the time since the last consumption is the shortest, although the frequency and monetary are low, but the consumption has been relatively active, they can be motivated by some promotional tools.

  • group3: customers are the most valuable in consumption and the cumulative consumption amount are the largest, but latest they are not quite active. They are high-value customers and need to be maintained.

  • group4: customers are the latest to register but have worst consumption behaviors. They are likely to be churned customers and need to be activated

4.2 Analysis for PAM results

#clustering medoids location
ceters_pam=pam$medoids 
ceters_pam
##         recency  frequency   monetary          log
## [1,] -0.5203474 -0.4328713 -0.3423484 -0.783966892
## [2,] -0.6860492  1.7237555  1.0448106 -0.456384222
## [3,]  1.4680746 -0.4328713 -0.3640616 -0.006848221
## [4,] -0.2441777 -0.4328713 -0.2379083  1.115656525
#Cluster medoids coordinate restoration
ceters_pam[,1]=ceters_pam[,1]*sd(data1$recency)+mean(data1$recency)
ceters_pam[,2]=ceters_pam[,2]*sd(data1$frequency)+mean(data1$frequency)
ceters_pam[,3]=ceters_pam[,3]*sd(data1$monetary)+mean(data1$monetary)
ceters_pam[,4]=ceters_pam[,4]*sd(data1$log)+mean(data1$log)
ceters_pam
##      recency frequency monetary  log
## [1,]     272         1    870.0 2354
## [2,]     215         3   5342.0 2722
## [3,]     956         1    800.0 3227
## [4,]     367         1   1206.7 4488
# prepare for the radar data, which should contain the max and min value
ceters_pam=data.frame(ceters_pam)
min_pam=c(210,0,700,2300)
max_pam=c(1000,4,5400,4500)
radar_pam=rbind(max_pam,min_pam,ceters_pam)
#draw the radar plot
par(pin=c(3.8,3.8))
radarchart(
  radar_pam, axistype = 1,
  # Customize the polygon
  pcol = c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'), pfcol = scales::alpha(c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'),0.4), plwd = 3, plty = 1,
  # Customize the grid
  cglcol = "grey", cglty = 1, cglwd = 0.5,
  # Customize the axis
  axislabcol = "grey", 
  # Variable labels
  vlcex = 0.8, vlabels = colnames(radar),
  caxislabels = c(0, 5, 10, 15, 20))
# Add an horizontal legend
legend(
  x = "topright", 
  legend = c('group1','group2','group3','group4'), 
  horiz = TRUE,bty = "n", pch = 20 , 
  col = c("#00AFBB", "#E7B800", "#FC4E07",'#e99999'),
  text.col = "black", cex = 1, pt.cex = 1.5
)

Comparing the radar charts of K-Means and PAM, it is found that the grouping results are similar. Therefore, we can use both classification.

4.3 Customer Segmentation Visulization

Firstly, introducing the classification results into the customer data.

data1$segmentation=km$cluster
head(data1,10)
## # A tibble: 10 × 6
##    id       recency frequency monetary   log segmentation
##    <chr>      <dbl>     <int>    <dbl> <dbl>        <int>
##  1 000339f1      18         1      288  3417            4
##  2 000cd735     220         2     8460  3524            3
##  3 000e0c35     266         1      420  2317            4
##  4 0015a268     393         1      620  2240            4
##  5 0016caca    1069         1     2004  5122            2
##  6 001772f3     220         1     1050  5070            1
##  7 001e604d     920         1     1798  3527            2
##  8 0023bd95    1071         1     7984  2918            2
##  9 004452c7     418         1     6167  2407            4
## 10 0047a1e2     886         1      880  3568            2

The proportion fo people in different categories.

hist(data1$segmentation,col ='orange',axes=F)
axis(1,c(1,2,3,4),c('old custmer','active custmer','VIP customer','lost customer'))