Author: Joselene Marques
Affiliation: Justt
# Data manipulation
library("dplyr")
library("knitr")
library("tidyr")
# Cluster
library("kohonen")
library("NbClust")
library("factoextra")
# Visualization
library("ggplot2")
library("car") # associated with the book 'An R Companion to Applied Regression', Third Edition, by John Fox and Sanford Weisberg.
library("rgl")
The table below presents the most commonly used functions in dplyr.
| Function | Returns |
|---|---|
| filter(data, conditions) | rows from data where conditions hold |
| select(data, variables) | subset of the columns in data, as specified in variables |
| rrange(data, variables) | data sorted by variables |
| group_by(data, variable) | copy of data, with groups defined by variables |
| summarize(data, newvar = function) | data frame with newvar columns that summarize data (or each group in data) based on an aggregation function |
| mutate(data, newvar = function) | data frame with newvar columns defined by a function of existing columns |
| join(data1, data2, variables) | data frame that joins columns from data1 and data2 based on matching values of variables |
dsInvoice = read.csv2("D://Datasets//dsRFM.csv")
str(dsInvoice)
#> 'data.frame': 6850 obs. of 3 variables:
#> $ InvoiceDate: Factor w/ 287 levels "2017-01-02","2017-01-03",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ CustomerId : Factor w/ 612 levels "10J1001","10J1004",..: 560 411 354 558 448 448 110 474 346 109 ...
#> $ Amount : num 987 9333 251 8875 208 ...
require(dplyr)
dsInvoice <- dsInvoice %>% mutate(Amount = replace(Amount, Amount<=0, NA))
require(tidyr)
dsInvoice <- dsInvoice %>% drop_na()
dsRFM <- dsInvoice %>% group_by(CustomerId) %>%
summarise(recency=as.numeric(Sys.Date()-max(as.Date(InvoiceDate))),
frequency=n(), monetary = sum(Amount))
require(knitr)
kable(head(dsRFM))
| CustomerId | recency | frequency | monetary |
|---|---|---|---|
| 10J1001 | 16 | 16 | 18365.70 |
| 10J1004 | 24 | 7 | 18362.64 |
| 10J1007 | 16 | 44 | 59383.90 |
| 10J1008 | 86 | 3 | 9500.52 |
| 10J1009 | 158 | 1 | 1437.00 |
| 10J1013 | 66 | 2 | 6268.40 |
summary(dsRFM)
#> CustomerId recency frequency monetary
#> 10J1001: 1 Min. : 15.00 Min. : 1.00 Min. : 54.96
#> 10J1004: 1 1st Qu.: 24.00 1st Qu.: 2.00 1st Qu.: 3406.83
#> 10J1007: 1 Median : 51.00 Median : 6.00 Median : 10413.54
#> 10J1008: 1 Mean : 95.85 Mean :10.53 Mean : 18924.99
#> 10J1009: 1 3rd Qu.:102.00 3rd Qu.:14.00 3rd Qu.: 23253.25
#> 10J1013: 1 Max. :431.00 Max. :96.00 Max. :190448.34
#> (Other):591
hist(dsRFM$recency)
hist(dsRFM$frequency)
hist(dsRFM$monetary)
The outliers of the example dataset were removed during sampling and it will be not be detailed here. Keep in mind that outliers need to be handle before applying k-means clustering.
hist(log10(dsRFM$monetary))
dsData = dsRFM
dsData$monetary = log10(dsData$monetary)
row.names(dsData) <- dsRFM$CustomerId
#> Warning: Setting row names on a tibble is deprecated.
dsData <- scale(dsData[,2:4])
summary(dsData)
#> recency frequency monetary
#> Min. :-0.73177 Min. :-0.7924 Min. :-3.3773
#> 1st Qu.:-0.65031 1st Qu.:-0.7093 1st Qu.:-0.5978
#> Median :-0.40593 Median :-0.3768 Median : 0.1548
#> Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.05569 3rd Qu.: 0.2882 3rd Qu.: 0.6958
#> Max. : 3.03355 Max. : 7.1048 Max. : 2.1122
Default: kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c(“Hartigan-Wong”, “Lloyd”, “Forgy”, “MacQueen”), trace=FALSE)
k2m <- kmeans(dsData, centers = 6, iter.max = 500, nstart = 10)
k2m
#> K-means clustering with 6 clusters of sizes 158, 49, 98, 29, 195, 68
#>
#> Cluster means:
#> recency frequency monetary
#> 1 -0.5461060 0.6791523 0.89340636
#> 2 2.4845607 -0.7127040 -1.40111596
#> 3 -0.2043052 -0.6999802 -1.33865295
#> 4 -0.5903844 3.2722773 1.38563047
#> 5 -0.4576344 -0.3627274 0.11848354
#> 6 1.3371037 -0.4110249 -0.06769276
#>
#> Clustering vector:
#> [1] 1 5 4 5 3 5 1 5 1 2 3 3 6 5 2 5 5 3 2 5 5 5 6 2 2 5 1 5 5 1 1 5 3 5 5
#> [36] 1 1 5 5 1 6 1 4 4 1 6 1 5 5 1 1 1 4 1 1 1 2 1 6 5 6 1 1 1 6 6 1 6 5 6
#> [71] 1 1 5 3 5 2 5 5 1 1 5 1 3 5 1 2 1 5 6 5 1 5 5 5 1 1 2 5 5 5 2 1 1 1 5
#> [106] 3 3 4 5 5 6 2 5 1 5 3 5 5 5 5 1 5 5 1 5 5 1 1 5 2 1 3 5 1 2 4 2 6 1 6
#> [141] 4 1 2 1 6 1 6 5 1 5 5 5 2 1 3 5 6 2 1 5 4 1 1 1 1 6 6 5 5 1 5 1 2 2 5
#> [176] 5 6 5 1 6 6 6 1 5 6 4 5 1 6 1 3 3 3 1 3 5 1 1 2 6 1 5 6 1 4 1 1 6 4 5
#> [211] 1 1 4 4 5 5 1 6 2 5 2 5 5 5 2 5 3 5 5 5 6 6 5 5 1 5 1 1 1 1 1 5 1 1 3
#> [246] 1 5 1 1 5 1 1 2 1 6 3 3 5 5 1 1 4 6 2 5 1 1 1 1 5 1 1 2 6 6 5 1 3 4 4
#> [281] 6 1 5 1 1 6 4 4 6 6 5 6 1 1 1 6 2 5 1 1 1 1 6 3 3 5 6 6 1 5 5 1 1 1 2
#> [316] 5 5 5 3 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 1 4 3 5 1 5 4
#> [351] 5 3 5 1 1 3 5 6 5 5 3 5 5 6 3 1 3 1 3 3 1 5 3 3 5 1 3 3 3 1 2 3 3 1 5
#> [386] 3 4 3 3 5 5 3 3 3 2 3 5 6 3 5 1 5 5 5 6 5 3 1 2 3 5 2 3 3 3 3 1 6 3 2
#> [421] 5 3 5 5 1 6 4 5 6 5 1 5 1 1 1 1 3 5 5 3 2 5 2 5 1 4 1 4 1 5 5 1 5 2 5
#> [456] 1 1 4 5 1 1 1 5 3 3 5 6 3 1 6 5 5 3 3 3 6 5 3 3 3 5 5 6 6 3 5 2 6 1 5
#> [491] 2 2 6 3 5 1 5 5 4 5 6 5 5 5 5 3 3 5 2 4 1 3 1 5 1 5 1 5 5 2 5 5 6 5 5
#> [526] 5 2 1 5 5 2 5 5 6 2 5 1 5 4 5 5 5 5 1 5 1 4 3 1 2 2 1 1 5 5 1 5 3 5 1
#> [561] 1 3 5 5 5 1 3 5 5 5 1 2 2 5 5 5 5 1 5 6 5 1 5 6 3 6 1 6 5 6 5 5 5 6 5
#> [596] 5 1
#>
#> Within cluster sum of squares by cluster:
#> [1] 76.31263 45.74772 53.34269 32.29354 60.27798 43.71220
#> (between_SS / total_SS = 82.6 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss"
#> [5] "tot.withinss" "betweenss" "size" "iter"
#> [9] "ifault"
par(mfrow = c(1, 2))
image(t(dsData)[, nrow(dsData):1], yaxt = "n", main = "Original Data")
image(t(dsData)[, order(k2m$cluster)], yaxt = "n", main = "Clustered Data")
Provides ggplot2-based elegant visualization of partitioning methods including kmeans [stats package]. Observations are represented by points in the plot, using principal components if ncol(data) > 2. An ellipse is drawn around each cluster.
require("factoextra")
fviz_cluster(k2m, data = dsData)
fviz_cluster(k2m, data = dsRFM[,2:4])
colors <- c('red','orange','green3','deepskyblue','blue','darkorchid4','violet','pink1','tan3','black')
options(rgl.printRglwidget = TRUE)
scatter3d(x = (dsRFM$frequency), y = (dsRFM$monetary), z = (dsRFM$recency),
groups = as.factor(k2m$cluster),
xlab = "Frequency (Log-transformed)", ylab = "Monetary Value (log-transformed)",
zlab = "Recency (Log-transformed)",
surface.col = colors, axis.scales = FALSE,
surface = TRUE, # produces the horizonal planes through the graph at each level of monetary value
fit = "smooth",
# ellipsoid = TRUE, # to graph ellipses uses this command and comment out "surface = TRUE"
grid = TRUE, axis.col = c("black", "black", "black"))
scatter3d(x = (dsRFM$frequency), y = (dsRFM$monetary), z = (dsRFM$recency),
groups = as.factor(k2m$cluster),
xlab = "Frequency (Log-transformed)", ylab = "Monetary Value (log-transformed)",
zlab = "Recency (Log-transformed)",
surface.col = colors, axis.scales = FALSE,
surface = FALSE, # produces the horizonal planes through the graph at each level of monetary value
fit = "smooth",
ellipsoid = TRUE, # to graph ellipses uses this command and comment out "surface = TRUE"
grid = TRUE, axis.col = c("black", "black", "black"))
remove(colors)
set.seed(7)
require(kohonen)
som.rfm <- som(dsData, grid = somgrid(2, 3, "hexagonal"))
plot(som.rfm, main = "RFM data")
Using NbClust metrics to determine number of clusters
require(NbClust)
set.seed(1)
nc <- NbClust(dsData, min.nc=2, max.nc=7, method="kmeans")
#> *** : The Hubert index is a graphical method of determining the number of clusters.
#> In the plot of Hubert index, we seek a significant knee that corresponds to a
#> significant increase of the value of the measure i.e the significant peak in Hubert
#> index second differences plot.
#>
#> *** : The D index is a graphical method of determining the number of clusters.
#> In the plot of D index, we seek a significant knee (the significant peak in Dindex
#> second differences plot) that corresponds to a significant increase of the value of
#> the measure.
#>
#> *******************************************************************
#> * Among all indices:
#> * 4 proposed 2 as the best number of clusters
#> * 7 proposed 3 as the best number of clusters
#> * 7 proposed 4 as the best number of clusters
#> * 1 proposed 6 as the best number of clusters
#> * 4 proposed 7 as the best number of clusters
#>
#> ***** Conclusion *****
#>
#> * According to the majority rule, the best number of clusters is 3
#>
#>
#> *******************************************************************
nc$All.index # estimates for each number of clusters on 26 different metrics of model fit
#> KL CH Hartigan CCC Scott Marriot TrCovW
#> 2 0.3566 397.6290 314.6535 -6.4658 603.9131 145921203 120278.448
#> 3 0.9404 460.5050 280.6405 -2.1031 1447.6081 79899737 54554.612
#> 4 2.9277 544.6780 149.7869 6.5104 2041.7195 52508509 27391.144
#> 5 1.0739 548.2150 129.9202 9.1947 2393.2709 45531252 17216.374
#> 6 1.6119 559.8577 98.1217 11.7591 2770.4403 34857472 10365.859
#> 7 2.6191 559.4129 68.0330 13.2141 3018.6191 31307542 7973.819
#> TraceW Friedman Rubin Cindex DB Silhouette Duda Pseudot2
#> 2 1071.7599 2.5359 1.6683 0.1962 1.2232 0.3692 0.8669 44.0672
#> 3 701.0330 6.5105 2.5505 0.1680 1.0185 0.3904 0.7405 54.6572
#> 4 476.0969 9.5004 3.7555 0.1681 0.8777 0.4110 1.3445 -62.0127
#> 5 380.0895 12.0401 4.7042 0.1799 0.9113 0.3750 1.5664 -91.8474
#> 6 311.6868 15.8727 5.7365 0.1618 0.9032 0.3798 1.7929 -43.3410
#> 7 267.3067 18.3930 6.6889 0.1551 0.9487 0.3508 1.6079 -25.7098
#> Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert
#> 2 0.2606 0.4433 535.8800 0.4041 -0.0789 0.5998 0.0075 0.0007
#> 3 0.5933 0.4472 233.6777 0.5280 0.0205 0.7320 0.0125 0.0011
#> 4 -0.4293 0.4279 119.0242 0.5929 1.4921 0.7691 0.0164 0.0013
#> 5 -0.6117 0.3967 76.0179 0.5313 0.3685 1.0891 0.0180 0.0015
#> 6 -0.7260 0.3708 51.9478 0.5278 0.8379 1.1747 0.0102 0.0015
#> 7 -0.6189 0.3485 38.1867 0.4923 0.9313 1.4190 0.0181 0.0016
#> SDindex Dindex SDbw
#> 2 2.8685 1.1466 1.5492
#> 3 2.3771 0.9293 0.9725
#> 4 2.2695 0.7884 0.6047
#> 5 2.5067 0.6997 0.5928
#> 6 2.5389 0.6427 0.5778
#> 7 3.0266 0.5902 0.4688
barplot(table(nc$Best.n[1,]),
xlab="Number of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen by Criteria")
set.seed(123)
fviz_nbclust(dsData, kmeans, method = "wss")
# Alternative to silhouette
fviz_nbclust(dsData, kmeans, method = "silhouette")
length(k2m$cluster)
#> [1] 597
length(som.rfm$unit.classif)
#> [1] 597
nrow(dsRFM)
#> [1] 597
dsRFM$clustersom = as.factor(som.rfm$unit.classif)
dsRFM$clusterk=as.factor(k2m$cluster)
kable(head(dsRFM))
| CustomerId | recency | frequency | monetary | clustersom | clusterk |
|---|---|---|---|---|---|
| 10J1001 | 16 | 16 | 18365.70 | 1 | 1 |
| 10J1004 | 24 | 7 | 18362.64 | 3 | 5 |
| 10J1007 | 16 | 44 | 59383.90 | 2 | 4 |
| 10J1008 | 86 | 3 | 9500.52 | 3 | 5 |
| 10J1009 | 158 | 1 | 1437.00 | 5 | 3 |
| 10J1013 | 66 | 2 | 6268.40 | 3 | 5 |
The melt function takes data in wide format and stacks a set of columns into a single column of data.
require(reshape2)
#> Loading required package: reshape2
#>
#> Attaching package: 'reshape2'
#> The following object is masked from 'package:tidyr':
#>
#> smiths
dsClusters = melt(data.frame(dsRFM), id=c('CustomerId','clustersom','clusterk'))
kable(head(dsClusters))
| CustomerId | clustersom | clusterk | variable | value |
|---|---|---|---|---|
| 10J1001 | 1 | 1 | recency | 16 |
| 10J1004 | 3 | 5 | recency | 24 |
| 10J1007 | 2 | 4 | recency | 16 |
| 10J1008 | 3 | 5 | recency | 86 |
| 10J1009 | 5 | 3 | recency | 158 |
| 10J1013 | 3 | 5 | recency | 66 |
| ### Merge |
dsMember = read.csv2("D://Datasets//dsBonusRFM.csv")
dsMember = inner_join(dsMember,dsRFM )
#> Joining, by = "CustomerId"
#> Warning: Column `CustomerId` joining factors with different levels,
#> coercing to character vector
kable(head(dsMember))
| NoOfEmployeeGroup | IndustryType | CustomerId | MemberAmount | BonusPotential | PercBonusPotentialSaved | recency | frequency | monetary | clustersom | clusterk |
|---|---|---|---|---|---|---|---|---|---|---|
| 01 to 9 | 1 | 26F2634 | 676.40 | 9910.16 | 6.825319 | 17 | 4 | 2698.92 | 5 | 3 |
| 01 to 9 | 1 | 38M385 | 947.50 | 9910.16 | 9.560895 | 24 | 33 | 26229.88 | 1 | 1 |
| 01 to 9 | 1 | 27F2718 | 1104.75 | 9910.16 | 11.147650 | 66 | 7 | 5026.32 | 3 | 5 |
| 01 to 9 | 1 | 13J1305 | 1317.25 | 9910.16 | 13.291915 | 35 | 24 | 27794.34 | 1 | 1 |
| 01 to 9 | 1 | 27F2709 | 2035.45 | 9910.16 | 20.539023 | 24 | 21 | 35166.64 | 1 | 1 |
| 01 to 9 | 1 | 35M353 | 2666.49 | 9910.16 | 26.906629 | 16 | 26 | 33574.52 | 1 | 1 |
ggplot(dsClusters, aes(factor(clustersom), value, fill=variable)) + geom_bar(stat='identity', position='dodge') +
xlab("Cluster") + ylab('Values')
p1= ggplot(dsMember, aes(x=recency, colour=clusterk, fill=clusterk, group = clusterk)) +
geom_density(alpha=.5) + theme(legend.position="left")
p2= ggplot(dsMember, aes(x=monetary, colour=clusterk, fill=clusterk, group = clusterk)) +
geom_density(alpha=.5) + theme(legend.position="none")
p3= ggplot(dsMember, aes(x=frequency, colour=clusterk, fill=clusterk, group = clusterk)) +
geom_density(alpha=.5) + theme(legend.position="none")
p4= ggplot(dsMember, aes(x=recency, colour=clustersom, fill=clustersom)) + geom_density(alpha=.5) + theme(legend.position="left")
p5= ggplot(dsMember, aes(x=monetary, colour=clustersom, fill=clustersom)) + geom_density(alpha=.5) + theme(legend.position="none")
p6= ggplot(dsMember, aes(x=frequency, colour=clustersom, fill=clustersom)) + geom_density(alpha=.5) + theme(legend.position="none")
require(ggpubr)
#> Loading required package: ggpubr
#> Loading required package: magrittr
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:tidyr':
#>
#> extract
ggarrange(p1,p4,p2,p5,p3,p6, ncol = 2, nrow = 3)
ggplot(dsMember, aes(x=factor(clustersom), y=PercBonusPotentialSaved)) + geom_bar(stat = "summary", fun.y = "mean")+
xlab("Cluster") + ylab('Percentage saved in \nOffice Supplies') + ggtitle("Bonus Pontential Savings")
Since K-means partitioning is an NPhard problem, an approximate solution is to use multiple random starts of the algorithm and retaining the solution that minimizes the total error sum of squares criterion. A more direct and computer-efficient approach is to first apply Ward’s minimum variance agglomerative clustering to the data, identify the partition of the objects into K groups in the dendrogram, and then use that partition as the starting approximation for Kmeans partitioning 3. Ward’s method: Agglomerative clustering method based on a classical sum-of-squares criterion, producing groups that minimize within-group dispersion.
hclust: Hierarchical cluster analysis on a set of dissimilarities and methods. Example using Euclidean distances (dist) and ward.D2 method.
d <- dist(dsData)
c <- hclust(d, method = 'ward.D2')
plot(c)
# Cut
members <- cutree(c,k = 8)
members[1:5]
#> [1] 1 1 2 3 4
table(members)
#> members
#> 1 2 3 4 5 6 7 8
#> 177 24 157 75 67 9 48 40
aggregate(dsRFM[,2:4], by=list(members), mean)
#> Group.1 recency frequency monetary
#> 1 1 33.98305 12.564972 29146.385
#> 2 2 26.08333 52.541667 75553.262
#> 3 3 49.42675 6.127389 7245.670
#> 4 4 81.08000 1.506667 1038.074
#> 5 5 338.05970 3.223881 4416.419
#> 6 6 398.88889 1.000000 205.660
#> 7 7 189.35417 6.583333 10909.573
#> 8 8 35.25000 29.675000 57229.812
Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100–108.↩
R. Wehrens and L.M.C. Buydens, J. Stat. Softw. 21 (5), 2007; R. Wehrens and J. Kruisselbrink, submitted, 2017.↩
Murtagh, F. et.al, Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?, Journal of Classification 31:274-295 (2014)↩