1. R Requirements

# Data manipulation
library("dplyr")
library("knitr")
library("tidyr") 

# Cluster
library("kohonen")
library("NbClust")
library("factoextra")

# Visualization
library("ggplot2")
library("car") # associated with the book 'An R Companion to Applied Regression', Third Edition, by John Fox and Sanford Weisberg.
library("rgl")

Key functions in dplyr

The table below presents the most commonly used functions in dplyr.

Function	Returns
filter(data, conditions)	rows from data where conditions hold
select(data, variables)	subset of the columns in data, as specified in variables
rrange(data, variables)	data sorted by variables
group_by(data, variable)	copy of data, with groups defined by variables
summarize(data, newvar = function)	data frame with newvar columns that summarize data (or each group in data) based on an aggregation function
mutate(data, newvar = function)	data frame with newvar columns defined by a function of existing columns
join(data1, data2, variables)	data frame that joins columns from data1 and data2 based on matching values of variables

2. Data load

dsInvoice = read.csv2("D://Datasets//dsRFM.csv")
str(dsInvoice)
#> 'data.frame':    6850 obs. of  3 variables:
#>  $ InvoiceDate: Factor w/ 287 levels "2017-01-02","2017-01-03",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ CustomerId : Factor w/ 612 levels "10J1001","10J1004",..: 560 411 354 558 448 448 110 474 346 109 ...
#>  $ Amount     : num  987 9333 251 8875 208 ...

3. Data manipulation

3.1. Delete all negative Invoice amount and NA

require(dplyr)
dsInvoice <- dsInvoice %>% mutate(Amount = replace(Amount, Amount<=0, NA))
require(tidyr) 
dsInvoice <- dsInvoice %>% drop_na()

3.2. Calculate RFM

dsRFM <- dsInvoice %>% group_by(CustomerId) %>% 
  summarise(recency=as.numeric(Sys.Date()-max(as.Date(InvoiceDate))),
            frequency=n(), monetary = sum(Amount))

4. Data explore and method assumptions

k-means assume the variance of the distribution of each attribute (variable) is spherical;
all variables have ideally the same variance since clusters will tend to be separated along variables with greater variance. K-means clustering is “isotropic” in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance;
the prior probability for all k clusters are the same, i.e. each cluster tend to have equal number of observations;

require(knitr)
kable(head(dsRFM))

CustomerId	recency	frequency	monetary
10J1001	16	16	18365.70
10J1004	24	7	18362.64
10J1007	16	44	59383.90
10J1008	86	3	9500.52
10J1009	158	1	1437.00
10J1013	66	2	6268.40

summary(dsRFM)
#>    CustomerId     recency         frequency        monetary        
#>  10J1001:  1   Min.   : 15.00   Min.   : 1.00   Min.   :    54.96  
#>  10J1004:  1   1st Qu.: 24.00   1st Qu.: 2.00   1st Qu.:  3406.83  
#>  10J1007:  1   Median : 51.00   Median : 6.00   Median : 10413.54  
#>  10J1008:  1   Mean   : 95.85   Mean   :10.53   Mean   : 18924.99  
#>  10J1009:  1   3rd Qu.:102.00   3rd Qu.:14.00   3rd Qu.: 23253.25  
#>  10J1013:  1   Max.   :431.00   Max.   :96.00   Max.   :190448.34  
#>  (Other):591

hist(dsRFM$recency)

hist(dsRFM$frequency)

hist(dsRFM$monetary)

4.1 Remove outliers

The outliers of the example dataset were removed during sampling and it will be not be detailed here. Keep in mind that outliers need to be handle before applying k-means clustering.

4.2 Log transformation

hist(log10(dsRFM$monetary))

dsData = dsRFM
dsData$monetary = log10(dsData$monetary)

4.3. Normalize

row.names(dsData) <- dsRFM$CustomerId
#> Warning: Setting row names on a tibble is deprecated.
dsData <- scale(dsData[,2:4])
summary(dsData)
#>     recency           frequency          monetary      
#>  Min.   :-0.73177   Min.   :-0.7924   Min.   :-3.3773  
#>  1st Qu.:-0.65031   1st Qu.:-0.7093   1st Qu.:-0.5978  
#>  Median :-0.40593   Median :-0.3768   Median : 0.1548  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.05569   3rd Qu.: 0.2882   3rd Qu.: 0.6958  
#>  Max.   : 3.03355   Max.   : 7.1048   Max.   : 2.1122

5. K-means segmentation

Kmeans ¹

Default: kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c(“Hartigan-Wong”, “Lloyd”, “Forgy”, “MacQueen”), trace=FALSE)

k2m <- kmeans(dsData, centers = 6, iter.max = 500, nstart = 10) 
k2m
#> K-means clustering with 6 clusters of sizes 158, 49, 98, 29, 195, 68
#> 
#> Cluster means:
#>      recency  frequency    monetary
#> 1 -0.5461060  0.6791523  0.89340636
#> 2  2.4845607 -0.7127040 -1.40111596
#> 3 -0.2043052 -0.6999802 -1.33865295
#> 4 -0.5903844  3.2722773  1.38563047
#> 5 -0.4576344 -0.3627274  0.11848354
#> 6  1.3371037 -0.4110249 -0.06769276
#> 
#> Clustering vector:
#>   [1] 1 5 4 5 3 5 1 5 1 2 3 3 6 5 2 5 5 3 2 5 5 5 6 2 2 5 1 5 5 1 1 5 3 5 5
#>  [36] 1 1 5 5 1 6 1 4 4 1 6 1 5 5 1 1 1 4 1 1 1 2 1 6 5 6 1 1 1 6 6 1 6 5 6
#>  [71] 1 1 5 3 5 2 5 5 1 1 5 1 3 5 1 2 1 5 6 5 1 5 5 5 1 1 2 5 5 5 2 1 1 1 5
#> [106] 3 3 4 5 5 6 2 5 1 5 3 5 5 5 5 1 5 5 1 5 5 1 1 5 2 1 3 5 1 2 4 2 6 1 6
#> [141] 4 1 2 1 6 1 6 5 1 5 5 5 2 1 3 5 6 2 1 5 4 1 1 1 1 6 6 5 5 1 5 1 2 2 5
#> [176] 5 6 5 1 6 6 6 1 5 6 4 5 1 6 1 3 3 3 1 3 5 1 1 2 6 1 5 6 1 4 1 1 6 4 5
#> [211] 1 1 4 4 5 5 1 6 2 5 2 5 5 5 2 5 3 5 5 5 6 6 5 5 1 5 1 1 1 1 1 5 1 1 3
#> [246] 1 5 1 1 5 1 1 2 1 6 3 3 5 5 1 1 4 6 2 5 1 1 1 1 5 1 1 2 6 6 5 1 3 4 4
#> [281] 6 1 5 1 1 6 4 4 6 6 5 6 1 1 1 6 2 5 1 1 1 1 6 3 3 5 6 6 1 5 5 1 1 1 2
#> [316] 5 5 5 3 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 1 4 3 5 1 5 4
#> [351] 5 3 5 1 1 3 5 6 5 5 3 5 5 6 3 1 3 1 3 3 1 5 3 3 5 1 3 3 3 1 2 3 3 1 5
#> [386] 3 4 3 3 5 5 3 3 3 2 3 5 6 3 5 1 5 5 5 6 5 3 1 2 3 5 2 3 3 3 3 1 6 3 2
#> [421] 5 3 5 5 1 6 4 5 6 5 1 5 1 1 1 1 3 5 5 3 2 5 2 5 1 4 1 4 1 5 5 1 5 2 5
#> [456] 1 1 4 5 1 1 1 5 3 3 5 6 3 1 6 5 5 3 3 3 6 5 3 3 3 5 5 6 6 3 5 2 6 1 5
#> [491] 2 2 6 3 5 1 5 5 4 5 6 5 5 5 5 3 3 5 2 4 1 3 1 5 1 5 1 5 5 2 5 5 6 5 5
#> [526] 5 2 1 5 5 2 5 5 6 2 5 1 5 4 5 5 5 5 1 5 1 4 3 1 2 2 1 1 5 5 1 5 3 5 1
#> [561] 1 3 5 5 5 1 3 5 5 5 1 2 2 5 5 5 5 1 5 6 5 1 5 6 3 6 1 6 5 6 5 5 5 6 5
#> [596] 5 1
#> 
#> Within cluster sum of squares by cluster:
#> [1] 76.31263 45.74772 53.34269 32.29354 60.27798 43.71220
#>  (between_SS / total_SS =  82.6 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"    
#> [5] "tot.withinss" "betweenss"    "size"         "iter"        
#> [9] "ifault"

Heatmap

par(mfrow = c(1, 2))
image(t(dsData)[, nrow(dsData):1], yaxt = "n", main = "Original Data")
image(t(dsData)[, order(k2m$cluster)], yaxt = "n", main = "Clustered Data")

Package factoextra

Provides ggplot2-based elegant visualization of partitioning methods including kmeans [stats package]. Observations are represented by points in the plot, using principal components if ncol(data) > 2. An ellipse is drawn around each cluster.

require("factoextra")
fviz_cluster(k2m, data = dsData)

fviz_cluster(k2m, data = dsRFM[,2:4])

5.2 Visualization (RGUI)

colors <- c('red','orange','green3','deepskyblue','blue','darkorchid4','violet','pink1','tan3','black')

options(rgl.printRglwidget = TRUE)
scatter3d(x = (dsRFM$frequency), y = (dsRFM$monetary), z = (dsRFM$recency),
          groups = as.factor(k2m$cluster),
          xlab = "Frequency (Log-transformed)", ylab = "Monetary Value (log-transformed)",
          zlab = "Recency (Log-transformed)",
          surface.col = colors, axis.scales = FALSE,
          surface = TRUE, # produces the horizonal planes through the graph at each level of monetary value
          fit = "smooth",
          # ellipsoid = TRUE, # to graph ellipses uses this command and comment out "surface = TRUE"
          grid = TRUE, axis.col = c("black", "black", "black"))

scatter3d(x = (dsRFM$frequency), y = (dsRFM$monetary), z = (dsRFM$recency),
          groups = as.factor(k2m$cluster),
          xlab = "Frequency (Log-transformed)", ylab = "Monetary Value (log-transformed)",
          zlab = "Recency (Log-transformed)",
          surface.col = colors, axis.scales = FALSE,
          surface = FALSE, # produces the horizonal planes through the graph at each level of monetary value
          fit = "smooth",
          ellipsoid = TRUE, # to graph ellipses uses this command and comment out "surface = TRUE"
          grid = TRUE, axis.col = c("black", "black", "black"))

remove(colors)

6. SOM (kohonen segmentation) ²

set.seed(7)

require(kohonen)
som.rfm <- som(dsData, grid = somgrid(2, 3, "hexagonal"))
plot(som.rfm, main = "RFM data")

6. Number of Clusters

Using NbClust metrics to determine number of clusters

require(NbClust)
set.seed(1)
nc <- NbClust(dsData, min.nc=2, max.nc=7, method="kmeans")

#> *** : The Hubert index is a graphical method of determining the number of clusters.
#>                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
#>                 significant increase of the value of the measure i.e the significant peak in Hubert
#>                 index second differences plot. 
#>

#> *** : The D index is a graphical method of determining the number of clusters. 
#>                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
#>                 second differences plot) that corresponds to a significant increase of the value of
#>                 the measure. 
#>  
#> ******************************************************************* 
#> * Among all indices:                                                
#> * 4 proposed 2 as the best number of clusters 
#> * 7 proposed 3 as the best number of clusters 
#> * 7 proposed 4 as the best number of clusters 
#> * 1 proposed 6 as the best number of clusters 
#> * 4 proposed 7 as the best number of clusters 
#> 
#>                    ***** Conclusion *****                            
#>  
#> * According to the majority rule, the best number of clusters is  3 
#>  
#>  
#> *******************************************************************

nc$All.index # estimates for each number of clusters on 26 different metrics of model fit
#>       KL       CH Hartigan     CCC     Scott   Marriot     TrCovW
#> 2 0.3566 397.6290 314.6535 -6.4658  603.9131 145921203 120278.448
#> 3 0.9404 460.5050 280.6405 -2.1031 1447.6081  79899737  54554.612
#> 4 2.9277 544.6780 149.7869  6.5104 2041.7195  52508509  27391.144
#> 5 1.0739 548.2150 129.9202  9.1947 2393.2709  45531252  17216.374
#> 6 1.6119 559.8577  98.1217 11.7591 2770.4403  34857472  10365.859
#> 7 2.6191 559.4129  68.0330 13.2141 3018.6191  31307542   7973.819
#>      TraceW Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2
#> 2 1071.7599   2.5359 1.6683 0.1962 1.2232     0.3692 0.8669  44.0672
#> 3  701.0330   6.5105 2.5505 0.1680 1.0185     0.3904 0.7405  54.6572
#> 4  476.0969   9.5004 3.7555 0.1681 0.8777     0.4110 1.3445 -62.0127
#> 5  380.0895  12.0401 4.7042 0.1799 0.9113     0.3750 1.5664 -91.8474
#> 6  311.6868  15.8727 5.7365 0.1618 0.9032     0.3798 1.7929 -43.3410
#> 7  267.3067  18.3930 6.6889 0.1551 0.9487     0.3508 1.6079 -25.7098
#>     Beale Ratkowsky     Ball Ptbiserial    Frey McClain   Dunn Hubert
#> 2  0.2606    0.4433 535.8800     0.4041 -0.0789  0.5998 0.0075 0.0007
#> 3  0.5933    0.4472 233.6777     0.5280  0.0205  0.7320 0.0125 0.0011
#> 4 -0.4293    0.4279 119.0242     0.5929  1.4921  0.7691 0.0164 0.0013
#> 5 -0.6117    0.3967  76.0179     0.5313  0.3685  1.0891 0.0180 0.0015
#> 6 -0.7260    0.3708  51.9478     0.5278  0.8379  1.1747 0.0102 0.0015
#> 7 -0.6189    0.3485  38.1867     0.4923  0.9313  1.4190 0.0181 0.0016
#>   SDindex Dindex   SDbw
#> 2  2.8685 1.1466 1.5492
#> 3  2.3771 0.9293 0.9725
#> 4  2.2695 0.7884 0.6047
#> 5  2.5067 0.6997 0.5928
#> 6  2.5389 0.6427 0.5778
#> 7  3.0266 0.5902 0.4688

barplot(table(nc$Best.n[1,]), 
        xlab="Number of Clusters", ylab="Number of Criteria",
        main="Number of Clusters Chosen by Criteria")

set.seed(123)
fviz_nbclust(dsData, kmeans, method = "wss")



# Alternative to silhouette
fviz_nbclust(dsData, kmeans, method = "silhouette")

7. Insights and graphs

length(k2m$cluster)
#> [1] 597
length(som.rfm$unit.classif)
#> [1] 597
nrow(dsRFM)
#> [1] 597

dsRFM$clustersom = as.factor(som.rfm$unit.classif)
dsRFM$clusterk=as.factor(k2m$cluster)
kable(head(dsRFM))

CustomerId	recency	frequency	monetary	clustersom	clusterk
10J1001	16	16	18365.70	1	1
10J1004	24	7	18362.64	3	5
10J1007	16	44	59383.90	2	4
10J1008	86	3	9500.52	3	5
10J1009	158	1	1437.00	5	3
10J1013	66	2	6268.40	3	5

The melt function takes data in wide format and stacks a set of columns into a single column of data.

require(reshape2)
#> Loading required package: reshape2
#> 
#> Attaching package: 'reshape2'
#> The following object is masked from 'package:tidyr':
#> 
#>     smiths
dsClusters = melt(data.frame(dsRFM), id=c('CustomerId','clustersom','clusterk'))
kable(head(dsClusters))

CustomerId	clustersom	clusterk	variable	value
10J1001	1	1	recency	16
10J1004	3	5	recency	24
10J1007	2	4	recency	16
10J1008	3	5	recency	86
10J1009	5	3	recency	158
10J1013	3	5	recency	66
### Merge

dsMember = read.csv2("D://Datasets//dsBonusRFM.csv") 
dsMember = inner_join(dsMember,dsRFM )
#> Joining, by = "CustomerId"
#> Warning: Column `CustomerId` joining factors with different levels,
#> coercing to character vector
kable(head(dsMember))

NoOfEmployeeGroup	IndustryType	CustomerId	MemberAmount	BonusPotential	PercBonusPotentialSaved	recency	frequency	monetary	clustersom	clusterk
01 to 9	1	26F2634	676.40	9910.16	6.825319	17	4	2698.92	5	3
01 to 9	1	38M385	947.50	9910.16	9.560895	24	33	26229.88	1	1
01 to 9	1	27F2718	1104.75	9910.16	11.147650	66	7	5026.32	3	5
01 to 9	1	13J1305	1317.25	9910.16	13.291915	35	24	27794.34	1	1
01 to 9	1	27F2709	2035.45	9910.16	20.539023	24	21	35166.64	1	1
01 to 9	1	35M353	2666.49	9910.16	26.906629	16	26	33574.52	1	1

Cluster - bar plot

ggplot(dsClusters, aes(factor(clustersom), value, fill=variable)) + geom_bar(stat='identity', position='dodge') +
  xlab("Cluster") + ylab('Values')

Density

p1= ggplot(dsMember, aes(x=recency, colour=clusterk, fill=clusterk, group = clusterk))  + 
  geom_density(alpha=.5) + theme(legend.position="left")
p2= ggplot(dsMember, aes(x=monetary, colour=clusterk, fill=clusterk, group = clusterk))  + 
  geom_density(alpha=.5)  + theme(legend.position="none")
p3= ggplot(dsMember, aes(x=frequency, colour=clusterk, fill=clusterk, group = clusterk))  + 
  geom_density(alpha=.5) + theme(legend.position="none")
p4= ggplot(dsMember, aes(x=recency, colour=clustersom, fill=clustersom))  + geom_density(alpha=.5) + theme(legend.position="left") 
p5= ggplot(dsMember, aes(x=monetary, colour=clustersom, fill=clustersom))  + geom_density(alpha=.5) + theme(legend.position="none")
p6= ggplot(dsMember, aes(x=frequency, colour=clustersom, fill=clustersom))  + geom_density(alpha=.5) + theme(legend.position="none")
require(ggpubr)
#> Loading required package: ggpubr
#> Loading required package: magrittr
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
ggarrange(p1,p4,p2,p5,p3,p6, ncol = 2, nrow = 3)

Potential Savings


ggplot(dsMember, aes(x=factor(clustersom), y=PercBonusPotentialSaved)) + geom_bar(stat = "summary", fun.y = "mean")+
  xlab("Cluster") + ylab('Percentage saved in  \nOffice Supplies') +  ggtitle("Bonus Pontential Savings")

8. Hierarchical Clustering

Since K-means partitioning is an NPhard problem, an approximate solution is to use multiple random starts of the algorithm and retaining the solution that minimizes the total error sum of squares criterion. A more direct and computer-efficient approach is to first apply Ward’s minimum variance agglomerative clustering to the data, identify the partition of the objects into K groups in the dendrogram, and then use that partition as the starting approximation for Kmeans partitioning ³. Ward’s method: Agglomerative clustering method based on a classical sum-of-squares criterion, producing groups that minimize within-group dispersion.

hclust: Hierarchical cluster analysis on a set of dissimilarities and methods. Example using Euclidean distances (dist) and ward.D2 method.

d <- dist(dsData)
c <- hclust(d, method = 'ward.D2')
plot(c)


# Cut
members <- cutree(c,k = 8)
members[1:5]
#> [1] 1 1 2 3 4
table(members)
#> members
#>   1   2   3   4   5   6   7   8 
#> 177  24 157  75  67   9  48  40
aggregate(dsRFM[,2:4], by=list(members), mean)
#>   Group.1   recency frequency  monetary
#> 1       1  33.98305 12.564972 29146.385
#> 2       2  26.08333 52.541667 75553.262
#> 3       3  49.42675  6.127389  7245.670
#> 4       4  81.08000  1.506667  1038.074
#> 5       5 338.05970  3.223881  4416.419
#> 6       6 398.88889  1.000000   205.660
#> 7       7 189.35417  6.583333 10909.573
#> 8       8  35.25000 29.675000 57229.812

9. Consideration

K-means clustering results are sensitive to the order of objects in the dataset. Workaround1: Run the analysis several times, randomizing objects order; average the cluster centres of those runs and input the centres as initial ones for one final run of the analysis. Workaround2: Suggest the centres (ex: ward.D2ward.D2 followed by k-means)

Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100–108.↩
R. Wehrens and L.M.C. Buydens, J. Stat. Softw. 21 (5), 2007; R. Wehrens and J. Kruisselbrink, submitted, 2017.↩
Murtagh, F. et.al, Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?, Journal of Classification 31:274-295 (2014)↩

Customer Segmentation with k-MEANS and SOM

Author: Joselene Marques
Affiliation: Justt

2018-03-09

1. R Requirements

Key functions in dplyr

2. Data load

3. Data manipulation

3.1. Delete all negative Invoice amount and NA

3.2. Calculate RFM

4. Data explore and method assumptions

4.1 Remove outliers

4.2 Log transformation

4.3. Normalize

5. K-means segmentation

Kmeans ¹

Heatmap

Package factoextra

5.2 Visualization (RGUI)

6. SOM (kohonen segmentation) ²

6. Number of Clusters

7. Insights and graphs

Cluster - bar plot

Density

Potential Savings

8. Hierarchical Clustering

9. Consideration

Customer Segmentation with k-MEANS and SOM

Author: Joselene Marques Affiliation: Justt

2018-03-09

1. R Requirements

Key functions in dplyr

2. Data load

3. Data manipulation

3.1. Delete all negative Invoice amount and NA

3.2. Calculate RFM

4. Data explore and method assumptions

4.1 Remove outliers

4.2 Log transformation

4.3. Normalize

5. K-means segmentation

Kmeans 1

Heatmap

Package factoextra

5.2 Visualization (RGUI)

6. SOM (kohonen segmentation) 2

6. Number of Clusters

7. Insights and graphs

Cluster - bar plot

Density

Potential Savings

8. Hierarchical Clustering

9. Consideration

Author: Joselene Marques
Affiliation: Justt

Kmeans ¹

6. SOM (kohonen segmentation) ²