Introduction

This is a walk-through of a customer segmentation process using R’s ‘skmeans’ package to perform k-medians clustering. The dataset examined is that used in chapter 2 of John Foreman’s book, Data Smart [http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html].

In large part, the approach used is that outlined by Foreman. The major difference, of course, is that Foreman implemented his processes in Excel, in support of his teaching objectives. R has the advantages of being more parsimonious, more flexible and easier to automate than manually building a model.

This walk-through includes function calls to calculate silhouette values, measures which can be used to assess the effectiveness of different segmentation (k) values.

R Code

# Load R packages
library(skmeans) # required for the k-medians algorithm
library(cluster) # required for calling the silhouette function
library(dplyr) # for summarizing, ordering and filtering data features
# Read in the sales promotion dataset. Remove meta columns, convert NA values to zeros
kmcDF <- read.csv(".\\wineKMC_matrix.csv") #reads in the data as a dataframe
wineDF <- t(kmcDF[,-c(1,2,3,4,5,6,7)]) # new variable, metadata columns removed, dataframe transposed
wineDF[is.na(wineDF)] <- 0 # replaces blank entries with zeros 
wineMatrix <-as.matrix(wineDF) # converts the dataframe to type matrix

# Segment the customers into 5 clusters
partition <- skmeans(wineMatrix, 5)

# Look at the segmentation outcome
partition # returns a summary statement of the segmentation
## A hard spherical k-means partition of 100 objects into 5 classes.
## Class sizes: 22, 23, 14, 17, 24
## Call: skmeans(x = wineMatrix, k = 5)
# Create a vector of customer names for each cluster. The vector, partition$cluster contains the information on which cluster each customer has been assigned to
cluster_1 <- names(partition$cluster[partition$cluster == 1])
cluster_2 <- names(partition$cluster[partition$cluster == 2])
cluster_3 <- names(partition$cluster[partition$cluster == 3])
cluster_4 <- names(partition$cluster[partition$cluster == 4])
cluster_5 <- names(partition$cluster[partition$cluster == 5])

# Print the customer names for each customer to the screen
cluster_1
##  [1] "Adams"     "Bailey"    "Brown"     "Carter"    "Cruz"     
##  [6] "Diaz"      "Green"     "Hernandez" "Hughes"    "James"    
## [11] "King"      "Lewis"     "Long"      "Murphy"    "Myers"    
## [16] "Perez"     "Rivera"    "Robinson"  "Ross"      "Taylor"   
## [21] "Walker"    "Watson"
cluster_2
##  [1] "Baker"      "Barnes"     "Clark"      "Cooper"     "Gonzalez"  
##  [6] "Harris"     "Kelly"      "Lee"        "Miller"     "Morales"   
## [11] "Nelson"     "Nguyen"     "Parker"     "Price"      "Richardson"
## [16] "Roberts"    "Scott"      "Turner"     "Ward"       "Williams"  
## [21] "Wood"       "Wright"     "Young"
cluster_3
##  [1] "Bennett"   "Brooks"    "Edwards"   "Gutierrez" "Hill"     
##  [6] "Jones"     "Morgan"    "Ortiz"     "Perry"     "Rogers"   
## [11] "Stewart"   "Sullivan"  "Torres"    "Wilson"
cluster_4
##  [1] "Anderson"  "Bell"      "Campbell"  "Cook"      "Cox"      
##  [6] "Flores"    "Gray"      "Jenkins"   "Johnson"   "Mitchell" 
## [11] "Moore"     "Morris"    "Peterson"  "Phillips"  "Rodriguez"
## [16] "Russell"   "Smith"
cluster_5
##  [1] "Allen"    "Butler"   "Collins"  "Davis"    "Evans"    "Fisher"  
##  [7] "Foster"   "Garcia"   "Gomez"    "Hall"     "Howard"   "Jackson" 
## [13] "Lopez"    "Martin"   "Martinez" "Powell"   "Ramirez"  "Reed"    
## [19] "Reyes"    "Sanchez"  "Sanders"  "Thomas"   "Thompson" "White"
# Examine characteristics of each cluster using the aggregation function to sum the number of purchases for each promotion by cluster.
clusterCounts <- t(aggregate(wineDF, by=list(partition$cluster), sum)[,2:33]) # based on Data Smart, Ch 10
clusterCounts <- cbind(kmcDF[,c(1:7)], clusterCounts) # add back the meta data columns

# The arrange function in the dplyr package is used to view the characteristics of the different clusters. Results are arranged by cluster and in descending order of number of purchases 
head(arrange(clusterCounts, -clusterCounts$"1"), 7)
##   Offer.. Campaign     Varietal Minimum.Qty..kg. Discount....       Origin
## 1       7    March     Prosecco                6           40    Australia
## 2      29 November Pinot Grigio                6           87       France
## 3      30 December       Malbec                6           54       France
## 4      18     July    Espumante                6           50       Oregon
## 5       8    March    Espumante                6           45 South Africa
## 6      13      May       Merlot                6           43        Chile
## 7       3 February    Espumante              144           32       Oregon
##   Past.Peak  1 2  3 4 5
## 1      TRUE 15 3  1 0 0
## 2     FALSE 12 1  4 0 0
## 3     FALSE 12 1  4 0 5
## 4     FALSE  9 1  4 0 0
## 5     FALSE  4 2 14 0 0
## 6     FALSE  4 0  2 0 0
## 7      TRUE  1 3  2 0 0
head(arrange(clusterCounts, -clusterCounts$"2"), 7)
##   Offer.. Campaign  Varietal Minimum.Qty..kg. Discount....      Origin
## 1      31 December Champagne               72           89      France
## 2       6    March  Prosecco              144           86       Chile
## 3      22   August Champagne               72           63      France
## 4       4 February Champagne               72           48      France
## 5      10    April  Prosecco               72           52  California
## 6      19     July Champagne               12           66     Germany
## 7      27  October Champagne               72           88 New Zealand
##   Past.Peak 1  2 3 4  5
## 1     FALSE 0 16 0 0  1
## 2     FALSE 0 10 1 0  1
## 3     FALSE 0  9 2 0 10
## 4      TRUE 0  8 0 0  4
## 5     FALSE 1  4 1 1  0
## 6     FALSE 0  4 1 0  0
## 7     FALSE 0  4 1 1  3
head(arrange(clusterCounts, -clusterCounts$"3"), 7)
##   Offer.. Campaign     Varietal Minimum.Qty..kg. Discount....       Origin
## 1       8    March    Espumante                6           45 South Africa
## 2      18     July    Espumante                6           50       Oregon
## 3      29 November Pinot Grigio                6           87       France
## 4      30 December       Malbec                6           54       France
## 5       3 February    Espumante              144           32       Oregon
## 6      13      May       Merlot                6           43        Chile
## 7      22   August    Champagne               72           63       France
##   Past.Peak  1 2  3 4  5
## 1     FALSE  4 2 14 0  0
## 2     FALSE  9 1  4 0  0
## 3     FALSE 12 1  4 0  0
## 4     FALSE 12 1  4 0  5
## 5      TRUE  1 3  2 0  0
## 6     FALSE  4 0  2 0  0
## 7     FALSE  0 9  2 0 10
head(arrange(clusterCounts, -clusterCounts$"4"), 7)
##   Offer..  Campaign   Varietal Minimum.Qty..kg. Discount....     Origin
## 1      24 September Pinot Noir                6           34      Italy
## 2      26   October Pinot Noir              144           83  Australia
## 3       2   January Pinot Noir               72           17     France
## 4      17      July Pinot Noir               12           47    Germany
## 5       1   January     Malbec               72           56     France
## 6      10     April   Prosecco               72           52 California
## 7      12       May   Prosecco               72           83  Australia
##   Past.Peak 1 2 3  4 5
## 1     FALSE 0 0 0 12 0
## 2     FALSE 0 0 0 12 3
## 3     FALSE 0 0 0  7 3
## 4     FALSE 0 0 0  7 0
## 5     FALSE 0 2 0  2 6
## 6     FALSE 1 4 1  1 0
## 7     FALSE 1 0 0  1 3
head(arrange(clusterCounts, -clusterCounts$"5"), 7)
##   Offer.. Campaign           Varietal Minimum.Qty..kg. Discount.... Origin
## 1      11      May          Champagne               72           85 France
## 2      22   August          Champagne               72           63 France
## 3       9    April         Chardonnay              144           57  Chile
## 4       1  January             Malbec               72           56 France
## 5      14     June             Merlot               72           64  Chile
## 6      25  October Cabernet Sauvignon               72           59 Oregon
## 7      30 December             Malbec                6           54 France
##   Past.Peak  1 2 3 4  5
## 1     FALSE  0 2 1 0 10
## 2     FALSE  0 9 2 0 10
## 3     FALSE  0 2 0 0  8
## 4     FALSE  0 2 0 2  6
## 5     FALSE  0 3 0 0  6
## 6      TRUE  0 0 0 0  6
## 7     FALSE 12 1 4 0  5
# Compare performance of the k=5 clustering process with the k=4 clustering using the silhouette function. The closer the value is to 1, the better.

silhouette_k5 <- silhouette(partition)
summary(silhouette_k5)  
## Silhouette of 100 units in 5 clusters from silhouette.skmeans(x = partition) :
##  Cluster sizes and average silhouette widths:
##         22         23         14         17         24 
## 0.24891252 0.14413204 0.34543622 0.47196893 0.07666648 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.09038  0.08616  0.16780  0.23490  0.38650  0.65730
plot(silhouette_k5)

partition_k4 <- skmeans(wineMatrix, 4)
silhouette_k4 <- silhouette(partition_k4)  
summary(silhouette_k4)   
## Silhouette of 100 units in 4 clusters from silhouette.skmeans(x = partition_k4) :
##  Cluster sizes and average silhouette widths:
##         25         26         16         33 
## 0.14182338 0.07482956 0.52008106 0.32830368 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.06625  0.10070  0.17990  0.24650  0.38540  0.67670
plot(silhouette_k4)

Typical silhouette values for repeated runs with k=5 and k=4 were about 0.23 and 0.24 respectively so we conclude that the k=5 and k=4 segmentation processes perform about the same.