This is a walk-through of a customer segmentation process using R’s ‘skmeans’ package to perform k-medians clustering. The dataset examined is that used in chapter 2 of John Foreman’s book, Data Smart [http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html].
In large part, the approach used is that outlined by Foreman. The major difference, of course, is that Foreman implemented his processes in Excel, in support of his teaching objectives. R has the advantages of being more parsimonious, more flexible and easier to automate than manually building a model.
This walk-through includes function calls to calculate silhouette values, measures which can be used to assess the effectiveness of different segmentation (k) values.
# Load R packages
library(skmeans) # required for the k-medians algorithm
library(cluster) # required for calling the silhouette function
library(dplyr) # for summarizing, ordering and filtering data features
# Read in the sales promotion dataset. Remove meta columns, convert NA values to zeros
kmcDF <- read.csv(".\\wineKMC_matrix.csv") #reads in the data as a dataframe
wineDF <- t(kmcDF[,-c(1,2,3,4,5,6,7)]) # new variable, metadata columns removed, dataframe transposed
wineDF[is.na(wineDF)] <- 0 # replaces blank entries with zeros
wineMatrix <-as.matrix(wineDF) # converts the dataframe to type matrix
# Segment the customers into 5 clusters
partition <- skmeans(wineMatrix, 5)
# Look at the segmentation outcome
partition # returns a summary statement of the segmentation
## A hard spherical k-means partition of 100 objects into 5 classes.
## Class sizes: 22, 23, 14, 17, 24
## Call: skmeans(x = wineMatrix, k = 5)
# Create a vector of customer names for each cluster. The vector, partition$cluster contains the information on which cluster each customer has been assigned to
cluster_1 <- names(partition$cluster[partition$cluster == 1])
cluster_2 <- names(partition$cluster[partition$cluster == 2])
cluster_3 <- names(partition$cluster[partition$cluster == 3])
cluster_4 <- names(partition$cluster[partition$cluster == 4])
cluster_5 <- names(partition$cluster[partition$cluster == 5])
# Print the customer names for each customer to the screen
cluster_1
## [1] "Adams" "Bailey" "Brown" "Carter" "Cruz"
## [6] "Diaz" "Green" "Hernandez" "Hughes" "James"
## [11] "King" "Lewis" "Long" "Murphy" "Myers"
## [16] "Perez" "Rivera" "Robinson" "Ross" "Taylor"
## [21] "Walker" "Watson"
cluster_2
## [1] "Baker" "Barnes" "Clark" "Cooper" "Gonzalez"
## [6] "Harris" "Kelly" "Lee" "Miller" "Morales"
## [11] "Nelson" "Nguyen" "Parker" "Price" "Richardson"
## [16] "Roberts" "Scott" "Turner" "Ward" "Williams"
## [21] "Wood" "Wright" "Young"
cluster_3
## [1] "Bennett" "Brooks" "Edwards" "Gutierrez" "Hill"
## [6] "Jones" "Morgan" "Ortiz" "Perry" "Rogers"
## [11] "Stewart" "Sullivan" "Torres" "Wilson"
cluster_4
## [1] "Anderson" "Bell" "Campbell" "Cook" "Cox"
## [6] "Flores" "Gray" "Jenkins" "Johnson" "Mitchell"
## [11] "Moore" "Morris" "Peterson" "Phillips" "Rodriguez"
## [16] "Russell" "Smith"
cluster_5
## [1] "Allen" "Butler" "Collins" "Davis" "Evans" "Fisher"
## [7] "Foster" "Garcia" "Gomez" "Hall" "Howard" "Jackson"
## [13] "Lopez" "Martin" "Martinez" "Powell" "Ramirez" "Reed"
## [19] "Reyes" "Sanchez" "Sanders" "Thomas" "Thompson" "White"
# Examine characteristics of each cluster using the aggregation function to sum the number of purchases for each promotion by cluster.
clusterCounts <- t(aggregate(wineDF, by=list(partition$cluster), sum)[,2:33]) # based on Data Smart, Ch 10
clusterCounts <- cbind(kmcDF[,c(1:7)], clusterCounts) # add back the meta data columns
# The arrange function in the dplyr package is used to view the characteristics of the different clusters. Results are arranged by cluster and in descending order of number of purchases
head(arrange(clusterCounts, -clusterCounts$"1"), 7)
## Offer.. Campaign Varietal Minimum.Qty..kg. Discount.... Origin
## 1 7 March Prosecco 6 40 Australia
## 2 29 November Pinot Grigio 6 87 France
## 3 30 December Malbec 6 54 France
## 4 18 July Espumante 6 50 Oregon
## 5 8 March Espumante 6 45 South Africa
## 6 13 May Merlot 6 43 Chile
## 7 3 February Espumante 144 32 Oregon
## Past.Peak 1 2 3 4 5
## 1 TRUE 15 3 1 0 0
## 2 FALSE 12 1 4 0 0
## 3 FALSE 12 1 4 0 5
## 4 FALSE 9 1 4 0 0
## 5 FALSE 4 2 14 0 0
## 6 FALSE 4 0 2 0 0
## 7 TRUE 1 3 2 0 0
head(arrange(clusterCounts, -clusterCounts$"2"), 7)
## Offer.. Campaign Varietal Minimum.Qty..kg. Discount.... Origin
## 1 31 December Champagne 72 89 France
## 2 6 March Prosecco 144 86 Chile
## 3 22 August Champagne 72 63 France
## 4 4 February Champagne 72 48 France
## 5 10 April Prosecco 72 52 California
## 6 19 July Champagne 12 66 Germany
## 7 27 October Champagne 72 88 New Zealand
## Past.Peak 1 2 3 4 5
## 1 FALSE 0 16 0 0 1
## 2 FALSE 0 10 1 0 1
## 3 FALSE 0 9 2 0 10
## 4 TRUE 0 8 0 0 4
## 5 FALSE 1 4 1 1 0
## 6 FALSE 0 4 1 0 0
## 7 FALSE 0 4 1 1 3
head(arrange(clusterCounts, -clusterCounts$"3"), 7)
## Offer.. Campaign Varietal Minimum.Qty..kg. Discount.... Origin
## 1 8 March Espumante 6 45 South Africa
## 2 18 July Espumante 6 50 Oregon
## 3 29 November Pinot Grigio 6 87 France
## 4 30 December Malbec 6 54 France
## 5 3 February Espumante 144 32 Oregon
## 6 13 May Merlot 6 43 Chile
## 7 22 August Champagne 72 63 France
## Past.Peak 1 2 3 4 5
## 1 FALSE 4 2 14 0 0
## 2 FALSE 9 1 4 0 0
## 3 FALSE 12 1 4 0 0
## 4 FALSE 12 1 4 0 5
## 5 TRUE 1 3 2 0 0
## 6 FALSE 4 0 2 0 0
## 7 FALSE 0 9 2 0 10
head(arrange(clusterCounts, -clusterCounts$"4"), 7)
## Offer.. Campaign Varietal Minimum.Qty..kg. Discount.... Origin
## 1 24 September Pinot Noir 6 34 Italy
## 2 26 October Pinot Noir 144 83 Australia
## 3 2 January Pinot Noir 72 17 France
## 4 17 July Pinot Noir 12 47 Germany
## 5 1 January Malbec 72 56 France
## 6 10 April Prosecco 72 52 California
## 7 12 May Prosecco 72 83 Australia
## Past.Peak 1 2 3 4 5
## 1 FALSE 0 0 0 12 0
## 2 FALSE 0 0 0 12 3
## 3 FALSE 0 0 0 7 3
## 4 FALSE 0 0 0 7 0
## 5 FALSE 0 2 0 2 6
## 6 FALSE 1 4 1 1 0
## 7 FALSE 1 0 0 1 3
head(arrange(clusterCounts, -clusterCounts$"5"), 7)
## Offer.. Campaign Varietal Minimum.Qty..kg. Discount.... Origin
## 1 11 May Champagne 72 85 France
## 2 22 August Champagne 72 63 France
## 3 9 April Chardonnay 144 57 Chile
## 4 1 January Malbec 72 56 France
## 5 14 June Merlot 72 64 Chile
## 6 25 October Cabernet Sauvignon 72 59 Oregon
## 7 30 December Malbec 6 54 France
## Past.Peak 1 2 3 4 5
## 1 FALSE 0 2 1 0 10
## 2 FALSE 0 9 2 0 10
## 3 FALSE 0 2 0 0 8
## 4 FALSE 0 2 0 2 6
## 5 FALSE 0 3 0 0 6
## 6 TRUE 0 0 0 0 6
## 7 FALSE 12 1 4 0 5
# Compare performance of the k=5 clustering process with the k=4 clustering using the silhouette function. The closer the value is to 1, the better.
silhouette_k5 <- silhouette(partition)
summary(silhouette_k5)
## Silhouette of 100 units in 5 clusters from silhouette.skmeans(x = partition) :
## Cluster sizes and average silhouette widths:
## 22 23 14 17 24
## 0.24891252 0.14413204 0.34543622 0.47196893 0.07666648
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.09038 0.08616 0.16780 0.23490 0.38650 0.65730
plot(silhouette_k5)
partition_k4 <- skmeans(wineMatrix, 4)
silhouette_k4 <- silhouette(partition_k4)
summary(silhouette_k4)
## Silhouette of 100 units in 4 clusters from silhouette.skmeans(x = partition_k4) :
## Cluster sizes and average silhouette widths:
## 25 26 16 33
## 0.14182338 0.07482956 0.52008106 0.32830368
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.06625 0.10070 0.17990 0.24650 0.38540 0.67670
plot(silhouette_k4)
Typical silhouette values for repeated runs with k=5 and k=4 were about 0.23 and 0.24 respectively so we conclude that the k=5 and k=4 segmentation processes perform about the same.