1. R Requirements

# Data manipulation
library("dplyr")
library("knitr")
library("tidyr") 

# Cluster
library("kohonen")
library("NbClust")
library("factoextra")

# Visualization
library("ggplot2")
library("car") # associated with the book 'An R Companion to Applied Regression', Third Edition, by John Fox and Sanford Weisberg.
library("rgl")

Key functions in dplyr

The table below presents the most commonly used functions in dplyr.

Function Returns
filter(data, conditions) rows from data where conditions hold
select(data, variables) subset of the columns in data, as specified in variables
rrange(data, variables) data sorted by variables
group_by(data, variable) copy of data, with groups defined by variables
summarize(data, newvar = function) data frame with newvar columns that summarize data (or each group in data) based on an aggregation function
mutate(data, newvar = function) data frame with newvar columns defined by a function of existing columns
join(data1, data2, variables) data frame that joins columns from data1 and data2 based on matching values of variables

2. Data load

dsInvoice = read.csv2("D://Datasets//dsRFM.csv")
str(dsInvoice)
#> 'data.frame':    6850 obs. of  3 variables:
#>  $ InvoiceDate: Factor w/ 287 levels "2017-01-02","2017-01-03",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ CustomerId : Factor w/ 612 levels "10J1001","10J1004",..: 560 411 354 558 448 448 110 474 346 109 ...
#>  $ Amount     : num  987 9333 251 8875 208 ...

3. Data manipulation

3.1. Delete all negative Invoice amount and NA

require(dplyr)
dsInvoice <- dsInvoice %>% mutate(Amount = replace(Amount, Amount<=0, NA))
require(tidyr) 
dsInvoice <- dsInvoice %>% drop_na()

3.2. Calculate RFM

dsRFM <- dsInvoice %>% group_by(CustomerId) %>% 
  summarise(recency=as.numeric(Sys.Date()-max(as.Date(InvoiceDate))),
            frequency=n(), monetary = sum(Amount)) 

4. Data explore and method assumptions

require(knitr)
kable(head(dsRFM))
CustomerId recency frequency monetary
10J1001 16 16 18365.70
10J1004 24 7 18362.64
10J1007 16 44 59383.90
10J1008 86 3 9500.52
10J1009 158 1 1437.00
10J1013 66 2 6268.40
summary(dsRFM)
#>    CustomerId     recency         frequency        monetary        
#>  10J1001:  1   Min.   : 15.00   Min.   : 1.00   Min.   :    54.96  
#>  10J1004:  1   1st Qu.: 24.00   1st Qu.: 2.00   1st Qu.:  3406.83  
#>  10J1007:  1   Median : 51.00   Median : 6.00   Median : 10413.54  
#>  10J1008:  1   Mean   : 95.85   Mean   :10.53   Mean   : 18924.99  
#>  10J1009:  1   3rd Qu.:102.00   3rd Qu.:14.00   3rd Qu.: 23253.25  
#>  10J1013:  1   Max.   :431.00   Max.   :96.00   Max.   :190448.34  
#>  (Other):591
hist(dsRFM$recency)

hist(dsRFM$frequency)

hist(dsRFM$monetary)

4.1 Remove outliers

The outliers of the example dataset were removed during sampling and it will be not be detailed here. Keep in mind that outliers need to be handle before applying k-means clustering.

4.2 Log transformation

hist(log10(dsRFM$monetary))

dsData = dsRFM
dsData$monetary = log10(dsData$monetary)

4.3. Normalize

row.names(dsData) <- dsRFM$CustomerId
#> Warning: Setting row names on a tibble is deprecated.
dsData <- scale(dsData[,2:4])
summary(dsData)
#>     recency           frequency          monetary      
#>  Min.   :-0.73177   Min.   :-0.7924   Min.   :-3.3773  
#>  1st Qu.:-0.65031   1st Qu.:-0.7093   1st Qu.:-0.5978  
#>  Median :-0.40593   Median :-0.3768   Median : 0.1548  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.05569   3rd Qu.: 0.2882   3rd Qu.: 0.6958  
#>  Max.   : 3.03355   Max.   : 7.1048   Max.   : 2.1122

5. K-means segmentation

Kmeans 1

Default: kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c(“Hartigan-Wong”, “Lloyd”, “Forgy”, “MacQueen”), trace=FALSE)

k2m <- kmeans(dsData, centers = 6, iter.max = 500, nstart = 10) 
k2m
#> K-means clustering with 6 clusters of sizes 158, 49, 98, 29, 195, 68
#> 
#> Cluster means:
#>      recency  frequency    monetary
#> 1 -0.5461060  0.6791523  0.89340636
#> 2  2.4845607 -0.7127040 -1.40111596
#> 3 -0.2043052 -0.6999802 -1.33865295
#> 4 -0.5903844  3.2722773  1.38563047
#> 5 -0.4576344 -0.3627274  0.11848354
#> 6  1.3371037 -0.4110249 -0.06769276
#> 
#> Clustering vector:
#>   [1] 1 5 4 5 3 5 1 5 1 2 3 3 6 5 2 5 5 3 2 5 5 5 6 2 2 5 1 5 5 1 1 5 3 5 5
#>  [36] 1 1 5 5 1 6 1 4 4 1 6 1 5 5 1 1 1 4 1 1 1 2 1 6 5 6 1 1 1 6 6 1 6 5 6
#>  [71] 1 1 5 3 5 2 5 5 1 1 5 1 3 5 1 2 1 5 6 5 1 5 5 5 1 1 2 5 5 5 2 1 1 1 5
#> [106] 3 3 4 5 5 6 2 5 1 5 3 5 5 5 5 1 5 5 1 5 5 1 1 5 2 1 3 5 1 2 4 2 6 1 6
#> [141] 4 1 2 1 6 1 6 5 1 5 5 5 2 1 3 5 6 2 1 5 4 1 1 1 1 6 6 5 5 1 5 1 2 2 5
#> [176] 5 6 5 1 6 6 6 1 5 6 4 5 1 6 1 3 3 3 1 3 5 1 1 2 6 1 5 6 1 4 1 1 6 4 5
#> [211] 1 1 4 4 5 5 1 6 2 5 2 5 5 5 2 5 3 5 5 5 6 6 5 5 1 5 1 1 1 1 1 5 1 1 3
#> [246] 1 5 1 1 5 1 1 2 1 6 3 3 5 5 1 1 4 6 2 5 1 1 1 1 5 1 1 2 6 6 5 1 3 4 4
#> [281] 6 1 5 1 1 6 4 4 6 6 5 6 1 1 1 6 2 5 1 1 1 1 6 3 3 5 6 6 1 5 5 1 1 1 2
#> [316] 5 5 5 3 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 1 4 3 5 1 5 4
#> [351] 5 3 5 1 1 3 5 6 5 5 3 5 5 6 3 1 3 1 3 3 1 5 3 3 5 1 3 3 3 1 2 3 3 1 5
#> [386] 3 4 3 3 5 5 3 3 3 2 3 5 6 3 5 1 5 5 5 6 5 3 1 2 3 5 2 3 3 3 3 1 6 3 2
#> [421] 5 3 5 5 1 6 4 5 6 5 1 5 1 1 1 1 3 5 5 3 2 5 2 5 1 4 1 4 1 5 5 1 5 2 5
#> [456] 1 1 4 5 1 1 1 5 3 3 5 6 3 1 6 5 5 3 3 3 6 5 3 3 3 5 5 6 6 3 5 2 6 1 5
#> [491] 2 2 6 3 5 1 5 5 4 5 6 5 5 5 5 3 3 5 2 4 1 3 1 5 1 5 1 5 5 2 5 5 6 5 5
#> [526] 5 2 1 5 5 2 5 5 6 2 5 1 5 4 5 5 5 5 1 5 1 4 3 1 2 2 1 1 5 5 1 5 3 5 1
#> [561] 1 3 5 5 5 1 3 5 5 5 1 2 2 5 5 5 5 1 5 6 5 1 5 6 3 6 1 6 5 6 5 5 5 6 5
#> [596] 5 1
#> 
#> Within cluster sum of squares by cluster:
#> [1] 76.31263 45.74772 53.34269 32.29354 60.27798 43.71220
#>  (between_SS / total_SS =  82.6 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"    
#> [5] "tot.withinss" "betweenss"    "size"         "iter"        
#> [9] "ifault"

Heatmap

par(mfrow = c(1, 2))
image(t(dsData)[, nrow(dsData):1], yaxt = "n", main = "Original Data")
image(t(dsData)[, order(k2m$cluster)], yaxt = "n", main = "Clustered Data")

Package factoextra

Provides ggplot2-based elegant visualization of partitioning methods including kmeans [stats package]. Observations are represented by points in the plot, using principal components if ncol(data) > 2. An ellipse is drawn around each cluster.

require("factoextra")
fviz_cluster(k2m, data = dsData)

fviz_cluster(k2m, data = dsRFM[,2:4])

5.2 Visualization (RGUI)

colors <- c('red','orange','green3','deepskyblue','blue','darkorchid4','violet','pink1','tan3','black')

options(rgl.printRglwidget = TRUE)
scatter3d(x = (dsRFM$frequency), y = (dsRFM$monetary), z = (dsRFM$recency),
          groups = as.factor(k2m$cluster),
          xlab = "Frequency (Log-transformed)", ylab = "Monetary Value (log-transformed)",
          zlab = "Recency (Log-transformed)",
          surface.col = colors, axis.scales = FALSE,
          surface = TRUE, # produces the horizonal planes through the graph at each level of monetary value
          fit = "smooth",
          # ellipsoid = TRUE, # to graph ellipses uses this command and comment out "surface = TRUE"
          grid = TRUE, axis.col = c("black", "black", "black"))

scatter3d(x = (dsRFM$frequency), y = (dsRFM$monetary), z = (dsRFM$recency),
          groups = as.factor(k2m$cluster),
          xlab = "Frequency (Log-transformed)", ylab = "Monetary Value (log-transformed)",
          zlab = "Recency (Log-transformed)",
          surface.col = colors, axis.scales = FALSE,
          surface = FALSE, # produces the horizonal planes through the graph at each level of monetary value
          fit = "smooth",
          ellipsoid = TRUE, # to graph ellipses uses this command and comment out "surface = TRUE"
          grid = TRUE, axis.col = c("black", "black", "black"))

remove(colors)

6. SOM (kohonen segmentation) 2

set.seed(7)

require(kohonen)
som.rfm <- som(dsData, grid = somgrid(2, 3, "hexagonal"))
plot(som.rfm, main = "RFM data")

6. Number of Clusters

Using NbClust metrics to determine number of clusters

require(NbClust)
set.seed(1)
nc <- NbClust(dsData, min.nc=2, max.nc=7, method="kmeans")

#> *** : The Hubert index is a graphical method of determining the number of clusters.
#>                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
#>                 significant increase of the value of the measure i.e the significant peak in Hubert
#>                 index second differences plot. 
#> 

#> *** : The D index is a graphical method of determining the number of clusters. 
#>                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
#>                 second differences plot) that corresponds to a significant increase of the value of
#>                 the measure. 
#>  
#> ******************************************************************* 
#> * Among all indices:                                                
#> * 4 proposed 2 as the best number of clusters 
#> * 7 proposed 3 as the best number of clusters 
#> * 7 proposed 4 as the best number of clusters 
#> * 1 proposed 6 as the best number of clusters 
#> * 4 proposed 7 as the best number of clusters 
#> 
#>                    ***** Conclusion *****                            
#>  
#> * According to the majority rule, the best number of clusters is  3 
#>  
#>  
#> *******************************************************************

nc$All.index # estimates for each number of clusters on 26 different metrics of model fit
#>       KL       CH Hartigan     CCC     Scott   Marriot     TrCovW
#> 2 0.3566 397.6290 314.6535 -6.4658  603.9131 145921203 120278.448
#> 3 0.9404 460.5050 280.6405 -2.1031 1447.6081  79899737  54554.612
#> 4 2.9277 544.6780 149.7869  6.5104 2041.7195  52508509  27391.144
#> 5 1.0739 548.2150 129.9202  9.1947 2393.2709  45531252  17216.374
#> 6 1.6119 559.8577  98.1217 11.7591 2770.4403  34857472  10365.859
#> 7 2.6191 559.4129  68.0330 13.2141 3018.6191  31307542   7973.819
#>      TraceW Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2
#> 2 1071.7599   2.5359 1.6683 0.1962 1.2232     0.3692 0.8669  44.0672
#> 3  701.0330   6.5105 2.5505 0.1680 1.0185     0.3904 0.7405  54.6572
#> 4  476.0969   9.5004 3.7555 0.1681 0.8777     0.4110 1.3445 -62.0127
#> 5  380.0895  12.0401 4.7042 0.1799 0.9113     0.3750 1.5664 -91.8474
#> 6  311.6868  15.8727 5.7365 0.1618 0.9032     0.3798 1.7929 -43.3410
#> 7  267.3067  18.3930 6.6889 0.1551 0.9487     0.3508 1.6079 -25.7098
#>     Beale Ratkowsky     Ball Ptbiserial    Frey McClain   Dunn Hubert
#> 2  0.2606    0.4433 535.8800     0.4041 -0.0789  0.5998 0.0075 0.0007
#> 3  0.5933    0.4472 233.6777     0.5280  0.0205  0.7320 0.0125 0.0011
#> 4 -0.4293    0.4279 119.0242     0.5929  1.4921  0.7691 0.0164 0.0013
#> 5 -0.6117    0.3967  76.0179     0.5313  0.3685  1.0891 0.0180 0.0015
#> 6 -0.7260    0.3708  51.9478     0.5278  0.8379  1.1747 0.0102 0.0015
#> 7 -0.6189    0.3485  38.1867     0.4923  0.9313  1.4190 0.0181 0.0016
#>   SDindex Dindex   SDbw
#> 2  2.8685 1.1466 1.5492
#> 3  2.3771 0.9293 0.9725
#> 4  2.2695 0.7884 0.6047
#> 5  2.5067 0.6997 0.5928
#> 6  2.5389 0.6427 0.5778
#> 7  3.0266 0.5902 0.4688

barplot(table(nc$Best.n[1,]), 
        xlab="Number of Clusters", ylab="Number of Criteria",
        main="Number of Clusters Chosen by Criteria")

set.seed(123)
fviz_nbclust(dsData, kmeans, method = "wss")



# Alternative to silhouette
fviz_nbclust(dsData, kmeans, method = "silhouette")

7. Insights and graphs

length(k2m$cluster)
#> [1] 597
length(som.rfm$unit.classif)
#> [1] 597
nrow(dsRFM)
#> [1] 597

dsRFM$clustersom = as.factor(som.rfm$unit.classif)
dsRFM$clusterk=as.factor(k2m$cluster)
kable(head(dsRFM))
CustomerId recency frequency monetary clustersom clusterk
10J1001 16 16 18365.70 1 1
10J1004 24 7 18362.64 3 5
10J1007 16 44 59383.90 2 4
10J1008 86 3 9500.52 3 5
10J1009 158 1 1437.00 5 3
10J1013 66 2 6268.40 3 5

The melt function takes data in wide format and stacks a set of columns into a single column of data.

require(reshape2)
#> Loading required package: reshape2
#> 
#> Attaching package: 'reshape2'
#> The following object is masked from 'package:tidyr':
#> 
#>     smiths
dsClusters = melt(data.frame(dsRFM), id=c('CustomerId','clustersom','clusterk'))
kable(head(dsClusters))
CustomerId clustersom clusterk variable value
10J1001 1 1 recency 16
10J1004 3 5 recency 24
10J1007 2 4 recency 16
10J1008 3 5 recency 86
10J1009 5 3 recency 158
10J1013 3 5 recency 66
### Merge
dsMember = read.csv2("D://Datasets//dsBonusRFM.csv") 
dsMember = inner_join(dsMember,dsRFM )
#> Joining, by = "CustomerId"
#> Warning: Column `CustomerId` joining factors with different levels,
#> coercing to character vector
kable(head(dsMember))
NoOfEmployeeGroup IndustryType CustomerId MemberAmount BonusPotential PercBonusPotentialSaved recency frequency monetary clustersom clusterk
01 to 9 1 26F2634 676.40 9910.16 6.825319 17 4 2698.92 5 3
01 to 9 1 38M385 947.50 9910.16 9.560895 24 33 26229.88 1 1
01 to 9 1 27F2718 1104.75 9910.16 11.147650 66 7 5026.32 3 5
01 to 9 1 13J1305 1317.25 9910.16 13.291915 35 24 27794.34 1 1
01 to 9 1 27F2709 2035.45 9910.16 20.539023 24 21 35166.64 1 1
01 to 9 1 35M353 2666.49 9910.16 26.906629 16 26 33574.52 1 1

Cluster - bar plot

ggplot(dsClusters, aes(factor(clustersom), value, fill=variable)) + geom_bar(stat='identity', position='dodge') +
  xlab("Cluster") + ylab('Values') 

Density

p1= ggplot(dsMember, aes(x=recency, colour=clusterk, fill=clusterk, group = clusterk))  + 
  geom_density(alpha=.5) + theme(legend.position="left")
p2= ggplot(dsMember, aes(x=monetary, colour=clusterk, fill=clusterk, group = clusterk))  + 
  geom_density(alpha=.5)  + theme(legend.position="none")
p3= ggplot(dsMember, aes(x=frequency, colour=clusterk, fill=clusterk, group = clusterk))  + 
  geom_density(alpha=.5) + theme(legend.position="none")
p4= ggplot(dsMember, aes(x=recency, colour=clustersom, fill=clustersom))  + geom_density(alpha=.5) + theme(legend.position="left") 
p5= ggplot(dsMember, aes(x=monetary, colour=clustersom, fill=clustersom))  + geom_density(alpha=.5) + theme(legend.position="none")
p6= ggplot(dsMember, aes(x=frequency, colour=clustersom, fill=clustersom))  + geom_density(alpha=.5) + theme(legend.position="none")
require(ggpubr)
#> Loading required package: ggpubr
#> Loading required package: magrittr
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
ggarrange(p1,p4,p2,p5,p3,p6, ncol = 2, nrow = 3)

Potential Savings


ggplot(dsMember, aes(x=factor(clustersom), y=PercBonusPotentialSaved)) + geom_bar(stat = "summary", fun.y = "mean")+
  xlab("Cluster") + ylab('Percentage saved in  \nOffice Supplies') +  ggtitle("Bonus Pontential Savings")

8. Hierarchical Clustering

Since K-means partitioning is an NPhard problem, an approximate solution is to use multiple random starts of the algorithm and retaining the solution that minimizes the total error sum of squares criterion. A more direct and computer-efficient approach is to first apply Ward’s minimum variance agglomerative clustering to the data, identify the partition of the objects into K groups in the dendrogram, and then use that partition as the starting approximation for Kmeans partitioning 3. Ward’s method: Agglomerative clustering method based on a classical sum-of-squares criterion, producing groups that minimize within-group dispersion.

hclust: Hierarchical cluster analysis on a set of dissimilarities and methods. Example using Euclidean distances (dist) and ward.D2 method.

d <- dist(dsData)
c <- hclust(d, method = 'ward.D2')
plot(c)


# Cut
members <- cutree(c,k = 8)
members[1:5]
#> [1] 1 1 2 3 4
table(members)
#> members
#>   1   2   3   4   5   6   7   8 
#> 177  24 157  75  67   9  48  40
aggregate(dsRFM[,2:4], by=list(members), mean)
#>   Group.1   recency frequency  monetary
#> 1       1  33.98305 12.564972 29146.385
#> 2       2  26.08333 52.541667 75553.262
#> 3       3  49.42675  6.127389  7245.670
#> 4       4  81.08000  1.506667  1038.074
#> 5       5 338.05970  3.223881  4416.419
#> 6       6 398.88889  1.000000   205.660
#> 7       7 189.35417  6.583333 10909.573
#> 8       8  35.25000 29.675000 57229.812

9. Consideration


  1. Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100–108.

  2. R. Wehrens and L.M.C. Buydens, J. Stat. Softw. 21 (5), 2007; R. Wehrens and J. Kruisselbrink, submitted, 2017.

  3. Murtagh, F. et.al, Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?, Journal of Classification 31:274-295 (2014)