This first example is to learn to make cluster analysis with R. The library rattle is loaded in order to use the data set wines.
# install.packages('rattle')
data(wine, package='rattle')
head(wine)
## Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49
## 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39
## Nonflavanoids Proanthocyanins Color Hue Dilution Proline
## 1 0.28 2.29 5.64 1.04 3.92 1065
## 2 0.26 1.28 4.38 1.05 3.40 1050
## 3 0.30 2.81 5.68 1.03 3.17 1185
## 4 0.24 2.18 7.80 0.86 3.45 1480
## 5 0.39 1.82 4.32 1.04 2.93 735
## 6 0.34 1.97 6.75 1.05 2.85 1450
In this data set we observe the composition of different wines. Given a set of observations \((x_1, x_2, ., x_n)\), where each observation is a \(d\)-dimensional real vector, \(k\)-means clustering aims to partition the n observations into (\(k \leq n\)) \(S = \{S_1, S_2, ., S_k\}\) so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find::
\[ \underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x_j \in S_i} \left\| \mathbf x_j - \boldsymbol\mu_i \right\|^2 \]
where \(\mu_i\) is the mean of points in \(S_i\). The clustering optimization problem is solved with the function kmeans in R.
wine.stand <- scale(wine[-1]) # To standarize the variables
# K-Means
k.means.fit <- kmeans(wine.stand, 3) # k = 3
In k.means.fit are contained all the elements of the cluster output:
attributes(k.means.fit)
## $names
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
##
## $class
## [1] "kmeans"
# Centroids:
k.means.fit$centers
## Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
## 1 0.1644 0.8691 0.1864 0.5229 -0.07526 -0.97658 -1.21183
## 2 0.8329 -0.3030 0.3637 -0.6085 0.57596 0.88275 0.97507
## 3 -0.9235 -0.3929 -0.4931 0.1701 -0.49033 -0.07577 0.02075
## Nonflavanoids Proanthocyanins Color Hue Dilution Proline
## 1 0.72402 -0.7775 0.9389 -1.1615 -1.2888 -0.4059
## 2 -0.56051 0.5787 0.1706 0.4727 0.7771 1.1220
## 3 -0.03344 0.0581 -0.8994 0.4605 0.2700 -0.7517
# Clusters:
k.means.fit$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3
## [71] 3 3 3 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1
# Cluster size:
k.means.fit$size
## [1] 51 62 65
A fundamental question is how to determine the value of the parameter \(k\). If we looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(wine.stand, nc=6)
Which is the optimal value for k in this case? Why?
Library clusters allow us to represent (with the aid of PCA) the cluster solution into 2 dimensions:
library(cluster)
clusplot(wine.stand, k.means.fit$cluster, main='2D representation of the Cluster solution',
color=TRUE, shade=TRUE,
labels=2, lines=0)
In order to evaluate the clustering performance we build a confusion matrix:
table(wine[,1],k.means.fit$cluster)
##
## 1 2 3
## 1 0 59 0
## 2 3 3 65
## 3 48 0 0
Hierarchical methods use a distance matrix as an input for the clustering algorithm. The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another.
d <- dist(wine.stand, method = "euclidean") # Euclidean distance matrix.
We use the Euclidean distance as an input for the clustering algorithm (Ward’s minimum variance criterion minimizes the total within-cluster variance):
H.fit <- hclust(d, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The clustering output can be displayed in a dendrogram
plot(H.fit) # display dendogram
groups <- cutree(H.fit, k=3) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(H.fit, k=3, border="red")
The clustering performance can be evaluated with the aid of a confusion matrix as follows:
table(wine[,1],groups)
## groups
## 1 2 3
## 1 58 1 0
## 2 7 58 6
## 3 0 0 48
We consider 25 European countries (n = 25 units) and their protein intakes (in percent) from nine major food sources (p = 9). The data are listed below.
url = 'http://www.biz.uiowa.edu/faculty/jledolter/DataMining/protein.csv'
food <- read.csv(url)
head(food)
## Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts
## 1 Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5
## 2 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3
## 3 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1
## 4 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7
## 5 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1
## 6 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7
## Fr.Veg
## 1 1.7
## 2 4.3
## 3 4.0
## 4 4.2
## 5 4.0
## 6 2.4
We start first, clustering on just Red and White meat (p=2) and k=3 clusters.
set.seed(123456789) ## to fix the random starting clusters
grpMeat <- kmeans(food[,c("WhiteMeat","RedMeat")], centers=3, nstart=10)
grpMeat
## K-means clustering with 3 clusters of sizes 8, 12, 5
##
## Cluster means:
## WhiteMeat RedMeat
## 1 12.062 8.838
## 2 4.658 8.258
## 3 9.000 15.180
##
## Clustering vector:
## [1] 2 1 3 2 1 1 1 2 3 2 1 3 2 1 2 1 2 2 2 2 3 3 2 1 2
##
## Within cluster sum of squares by cluster:
## [1] 39.46 69.86 35.67
## (between_SS / total_SS = 75.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
## list of cluster assignments
o=order(grpMeat$cluster)
data.frame(food$Country[o],grpMeat$cluster[o])
## food.Country.o. grpMeat.cluster.o.
## 1 Austria 1
## 2 Czechoslovakia 1
## 3 Denmark 1
## 4 E Germany 1
## 5 Hungary 1
## 6 Netherlands 1
## 7 Poland 1
## 8 W Germany 1
## 9 Albania 2
## 10 Bulgaria 2
## 11 Finland 2
## 12 Greece 2
## 13 Italy 2
## 14 Norway 2
## 15 Portugal 2
## 16 Romania 2
## 17 Spain 2
## 18 Sweden 2
## 19 USSR 2
## 20 Yugoslavia 2
## 21 Belgium 3
## 22 France 3
## 23 Ireland 3
## 24 Switzerland 3
## 25 UK 3
To see a graphical representation of the clustering solution we plot cluster assignments on Red and White meat on a scatter plot:
plot(food$Red, food$White, type="n", xlim=c(3,19), xlab="Red Meat", ylab="White Meat")
text(x=food$Red, y=food$White, labels=food$Country,col=grpMeat$cluster+1)
Next, we cluster on all nine protein groups and prepare the program to create seven clusters. The resulting clusters, shown in color on a scatter plot of white meat against red meat (any other pair of features could be selected), actually makes lot of sense. Countries in close geographic proximity tend to be clustered into the same group.
## same analysis, but now with clustering on all
## protein groups change the number of clusters to 7
set.seed(123456789)
grpProtein <- kmeans(food[,-1], centers=7, nstart=10)
o=order(grpProtein$cluster)
data.frame(food$Country[o],grpProtein$cluster[o])
## food.Country.o. grpProtein.cluster.o.
## 1 Austria 1
## 2 E Germany 1
## 3 Netherlands 1
## 4 W Germany 1
## 5 Portugal 2
## 6 Spain 2
## 7 Albania 3
## 8 Greece 3
## 9 Italy 3
## 10 USSR 3
## 11 Czechoslovakia 4
## 12 Hungary 4
## 13 Poland 4
## 14 Bulgaria 5
## 15 Romania 5
## 16 Yugoslavia 5
## 17 Denmark 6
## 18 Finland 6
## 19 Norway 6
## 20 Sweden 6
## 21 Belgium 7
## 22 France 7
## 23 Ireland 7
## 24 Switzerland 7
## 25 UK 7
library(cluster)
clusplot(food[,-1], grpProtein$cluster, main='2D representation of the Cluster solution', color=TRUE, shade=TRUE, labels=2, lines=0)
Alternatively we can implement a Hierarchical approach. We use the agnes function in the package cluster. Argument diss=FALSE indicates that we use the dissimilarity matrix that is being calculated from raw data. Argument metric=“euclidian” indicates that we use Euclidean distance. No standardization is used and the link function is the “average” linkage.
foodagg=agnes(food,diss=FALSE,metric="euclidian")
plot(foodagg, main='Dendrogram') ## dendrogram
groups <- cutree(foodagg, k=4) # cut tree into 3 clusters
rect.hclust(foodagg, k=4, border="red")
Customer segmentation is as simple as it sounds: grouping customers by their characteristics - and why would you want to do that? To better serve their needs!
Our example is to do with e-mail marketing. We use the dataset from this link
offers<-read.table('offers.csv', sep = ';', header=T)
head(offers)
## OfferID Campaign Varietal MinimumQt Discount Origin
## 1 1 January Malbec 72 56 France
## 2 2 January Pinot Noir 72 17 France
## 3 3 February Espumante 144 32 Oregon
## 4 4 February Champagne 72 48 France
## 5 5 February Cabernet Sauvignon 144 44 New Zealand
## 6 6 March Prosecco 144 86 Chile
## PastPeak
## 1 FALSE
## 2 FALSE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 FALSE
transactions<-read.table('transactions.csv', sep = ';', header=T)
head(transactions)
## CustomerLastName OfferID
## 1 Smith 2
## 2 Smith 24
## 3 Johnson 17
## 4 Johnson 24
## 5 Johnson 26
## 6 Williams 18
We have two data sets: one for the offers and the other for the transactions. First what we need to do is create a transaction matrix. That means, we need to put the offers we mailed out next to the transaction history of each customer. This is easily achieved with a pivot table.
# Create transaction matrix (a pivot table like in Excel way!)
library(reshape)
pivot<-melt(transactions[1:2])
## Using CustomerLastName as id variables
pivot<-(cast(pivot,value~CustomerLastName,fill=0,fun.aggregate=function(x) length(x)))
pivot<-cbind(offers,pivot[-1])
# write.csv(file="pivot.csv",pivot) # to save your data
cluster.data<-pivot[,8:length(pivot)]
cluster.data<-t(cluster.data)
head(cluster.data)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## Adams 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## Allen 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Anderson 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## Bailey 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Baker 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## Barnes 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
## 26 27 28 29 30 31 32
## Adams 0 0 0 1 1 0 0
## Allen 0 1 0 0 0 0 0
## Anderson 1 0 0 0 0 0 0
## Bailey 0 0 0 0 1 0 0
## Baker 0 0 0 0 0 1 0
## Barnes 0 0 0 0 0 1 0
In the clustering data set, rows represents costumers and columns are different wine brands/types.
We will use \(k = 4\) indicating that we will use 4 clusters. This is somewhat arbitrary, but the number you pick should be representative of the number of segments you can handle as a business. So 100 segments does not make sense for an e-mail marketing campaign.
We need to calculate how far away each customer is from the cluster’s mean. To do this we could use many distances/dissimilarity index, one of which is the Gower dissimilarity.
library(cluster)
D=daisy(cluster.data, metric='gower')
## Warning: binary variable(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
## 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32
## treated as interval scaled
After the creation of a distance matrix, we implement a Ward’s hierarchical clustering procedure:
H.fit <- hclust(D, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(H.fit) # display dendrogram
groups <- cutree(H.fit, k=4) # cut tree into 4 clusters
# draw dendogram with red borders around the 4 clusters
rect.hclust(H.fit, k=4, border="red")
# 2D representation of the Segmentation:
clusplot(cluster.data, groups, color=TRUE, shade=TRUE,
labels=2, lines=0, main= 'Customer segments')
Top get the top deals we will have to do a little bit of data manipulation. First we need to combine our clusters and transactions. Notably the lengths of the ‘tables’ holding transactions and clusters are different. So we need a way to merge the data . so we use the merge() function and give our columns sensible names:
# Merge Data
cluster.deals<-merge(transactions[1:2],groups,by.x = "CustomerLastName", by.y = "row.names")
colnames(cluster.deals)<-c("Name","Offer","Cluster")
head(cluster.deals)
## Name Offer Cluster
## 1 Adams 18 1
## 2 Adams 29 1
## 3 Adams 30 1
## 4 Allen 9 2
## 5 Allen 27 2
## 6 Anderson 24 3
We then want to repeat the pivoting process to get Offers in rows and clusters in columns counting the total number of transactions for each cluster. Once we have our pivot table we will merge it with the offers data table like we did before:
# Get top deals by cluster
cluster.pivot<-melt(cluster.deals,id=c("Offer","Cluster"))
cluster.pivot<-cast(cluster.pivot,Offer~Cluster,fun.aggregate=length)
cluster.topDeals<-cbind(offers,cluster.pivot[-1])
head(cluster.topDeals)
## OfferID Campaign Varietal MinimumQt Discount Origin
## 1 1 January Malbec 72 56 France
## 2 2 January Pinot Noir 72 17 France
## 3 3 February Espumante 144 32 Oregon
## 4 4 February Champagne 72 48 France
## 5 5 February Cabernet Sauvignon 144 44 New Zealand
## 6 6 March Prosecco 144 86 Chile
## PastPeak 1 2 3 4
## 1 FALSE 0 8 2 0
## 2 FALSE 0 3 7 0
## 3 TRUE 1 2 0 3
## 4 TRUE 0 8 0 4
## 5 TRUE 0 4 0 0
## 6 FALSE 1 5 0 6
#### And finally we can export the data in excel format with the command:
#### write.csv(file="topdeals.csv",cluster.topDeals,row.names=F)