The objective is to draw insights from the Whole Sale Customer data set located here: https://archive.ics.uci.edu/ml/datasets/wholesale+customers#
This project will be guided by examples in the “Machine Learning in R Cookbook”.
We’ll use methods of K-Means clustering and Hierarchical clustering analysis.
The first step is to load the data into R.
customers <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"), header = TRUE, sep = ",")
head(customers)
The attribute information taken from the website is the following:
Some descriptive stats…
summary(customers)
Channel Region Fresh Milk
Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
Median :1.000 Median :3.000 Median : 8504 Median : 3627
Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
Grocery Frozen Detergents_Paper
Min. : 3 Min. : 25.0 Min. : 3.0
1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8
Median : 4756 Median : 1526.0 Median : 816.5
Mean : 7951 Mean : 3071.9 Mean : 2881.5
3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0
Max. :92780 Max. :60869.0 Max. :40827.0
Delicassen
Min. : 3.0
1st Qu.: 408.2
Median : 965.5
Mean : 1524.9
3rd Qu.: 1820.2
Max. :47943.0
table(customers$Channel)
1 2
298 142
table(customers$Region)
1 2 3
77 47 316
sum(is.na(customers))
[1] 0
Because the region and channel variables are categorical, let’s remove them from our analysis for now. They don’t offer much insight on spending habits. We’ll also change the channel and region variables to factors, in case we want to use them later in the analysis.
customers$Region <- as.factor(customers$Region)
customers$Channel <- as.factor((customers$Channel))
newCust <- customers
newCust$Region <- NULL
newCust$Channel <- NULL
Let’s also normalize the data.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
newCust[1:6] <- as.data.frame(lapply(newCust[1:6], normalize))
summary(newCust)
Fresh Milk Grocery
Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.02786 1st Qu.:0.02012 1st Qu.:0.02317
Median :0.07580 Median :0.04864 Median :0.05122
Mean :0.10698 Mean :0.07817 Mean :0.08567
3rd Qu.:0.15097 3rd Qu.:0.09715 3rd Qu.:0.11482
Max. :1.00000 Max. :1.00000 Max. :1.00000
Frozen Detergents_Paper Delicassen
Min. :0.00000 Min. :0.000000 Min. :0.000000
1st Qu.:0.01179 1st Qu.:0.006216 1st Qu.:0.008453
Median :0.02467 Median :0.019927 Median :0.020077
Mean :0.05008 Mean :0.070510 Mean :0.031745
3rd Qu.:0.05800 3rd Qu.:0.095997 3rd Qu.:0.037907
Max. :1.00000 Max. :1.000000 Max. :1.000000
As seen above, all of the values are now between 0 and 1. We can now trust that each variable will be weighted equally in our analysis. I.E we can equalize the spending on each product to see which products should be marketed to which customers. We’ll also perform clusters without normalizing the data later in this project.
Let’s find optimal clusters for the given data set.
clus <- 2:12
set.seed(22)
WSS <- sapply(clus, function(k) {
kmeans(newCust, centers = k)$tot.withinss
})
WSS
[1] 17.245406 13.895508 10.766948 9.011122 7.964424 7.140254
[7] 6.938415 6.204416 5.666592 5.760167 4.842321
plot(clus, WSS, type = "l", xlab = "number of k", ylab = "within sum of squares")
Calculate the average silhouette width of various numbers of clusters.
library(fpc)
sw = sapply(clus, function(k) {
cluster.stats(dist(newCust), kmeans(newCust, centers = k)$cluster)$avg.silwidth
})
sw
[1] 0.5729735 0.4405736 0.3989539 0.3906469 0.3498627 0.3413289
[7] 0.3321986 0.2906882 0.2714457 0.2124369 0.2641221
plot(clus, sw, type = "l", xlab = "number of clusters", ylab = "average cilhouette width")
The graph above indicates that that 2 clusters would be optimal.
clus[which.max(sw)]
[1] 2
Let’s create the two clusters.
set.seed(22)
fit = kmeans(newCust, 2)
fit
K-means clustering with 2 clusters of sizes 394, 46
Cluster means:
Fresh Milk Grocery Frozen Detergents_Paper
1 0.10783586 0.05593035 0.06023356 0.05026197 0.04094287
2 0.09962418 0.26868693 0.30354600 0.04850001 0.32375723
Delicassen
1 0.02718838
2 0.07077642
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
[31] 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1
[61] 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1
[91] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
[121] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
[151] 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1
[181] 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2
[211] 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[241] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[271] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[301] 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
[331] 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1
[361] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[391] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
[421] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
Within cluster sum of squares by cluster:
[1] 10.170431 7.074975
(between_SS / total_SS = 30.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
barplot(t(fit$centers), beside = TRUE, xlab="cluster", ylab="value")
This indicates that the products “milk”, “grocery” and “detergents_paper” are being purchased at a much higher rate from the customers in cluster 2.
Let’s check out what a cluster plot looks like of this data
install.packages("cluster")
Error in install.packages : Updating loaded packages
library(cluster)
clusplot(newCust, fit$cluster, color = TRUE, shade = TRUE)
The clusters above contain one of size 394 and the other with the remaining 46 observations. These two clusters might not be optimal for drawing valuable insights, as it seemed to lump most of the data together into one cluster and then the remaining, scattered data into another. Let’s try some other clustering methods and then tamper with our clusters if necessary.
mds = cmdscale(dist(newCust), k = 2)
plot(mds, col = fit$cluster)
From this graph we see that the data is split into two distinct groups. Let’s see if the “region” and “channel” variables can be predicted by our clusters
table(customers$Channel, fit$cluster)
1 2
1 296 2
2 98 44
table(customers$Region, fit$cluster)
1 2
1 70 7
2 39 8
3 285 31
These tables suggest that the clusters that were created based on the spending data are not indicative of what region or what channel the money was being spent at.
Let’s look at a correlation matrix to see how the variables are related in the original data set.
install.packages("Hmisc")
Error in install.packages : Updating loaded packages
library(Hmisc)
y <- rcorr(as.matrix(newCust))
y
Fresh Milk Grocery Frozen Detergents_Paper
Fresh 1.00 0.10 -0.01 0.35 -0.10
Milk 0.10 1.00 0.73 0.12 0.66
Grocery -0.01 0.73 1.00 -0.04 0.92
Frozen 0.35 0.12 -0.04 1.00 -0.13
Detergents_Paper -0.10 0.66 0.92 -0.13 1.00
Delicassen 0.24 0.41 0.21 0.39 0.07
Delicassen
Fresh 0.24
Milk 0.41
Grocery 0.21
Frozen 0.39
Detergents_Paper 0.07
Delicassen 1.00
n= 440
P
Fresh Milk Grocery Frozen Detergents_Paper
Fresh 0.0351 0.8042 0.0000 0.0325
Milk 0.0351 0.0000 0.0092 0.0000
Grocery 0.8042 0.0000 0.4003 0.0000
Frozen 0.0000 0.0092 0.4003 0.0057
Detergents_Paper 0.0325 0.0000 0.0000 0.0057
Delicassen 0.0000 0.0000 0.0000 0.0000 0.1468
Delicassen
Fresh 0.0000
Milk 0.0000
Grocery 0.0000
Frozen 0.0000
Detergents_Paper 0.1468
Delicassen
“Milk” and “Fresh” share a weak correlation. Let’s see how they compare in the clusters we created.
plot(newCust[c("Fresh", "Milk")], col = fit$cluster, cex = .5)
points(fit$centers[,c("Fresh", "Milk")], col=5, pch="X")
As shown in the graph above, based on the clusters we created, there is a slight overlap between customers who purchase significant amount of milk and those who purchase a lot of Fresh produce. However, as the values increase for both milk and fresh produce respectively, the correlation becomes weaker. It is especially clear that those who purchase lots of fresh produce tend not to be in the market for milk. The bright “X” indicated the center of the cluster in this graph and in subsequent graphs in this assignment.
plot(newCust[c("Detergents_Paper", "Grocery")], col = fit$cluster, cex = .5)
points(fit$centers[,c("Detergents_Paper", "Grocery")], col=5, pch="X")
In this graph, we see that “Grocery” and “Detergents” variables are strongly correlated. It would be wise for the company to market both detergents_paper and grocery products to the observations shown in red. These customers tend to spend money on both of these products.
Let’s look at the data again but this time let’s not normalize before we create clusters. This will let the most popular products take precedent.
customers <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"), header = TRUE, sep = ",")
customers$Region <- as.factor(customers$Region)
customers$Channel <- as.factor((customers$Channel))
newCust <- customers
newCust$Region <- NULL
newCust$Channel <- NULL
clus <- 2:12
set.seed(22)
WSS <- sapply(clus, function(k) {
kmeans(newCust, centers = k)$tot.withinss
})
WSS
[1] 113217528521 80332413843 67315940100 53206131318
[5] 47493122543 41922735726 36317892575 33601567798
[9] 31888233451 28781999756 27499697494
plot(clus, WSS, type = "l", xlab = "number of k", ylab = "within sum of squares")
library(fpc)
sw = sapply(clus, function(k) {
cluster.stats(dist(newCust), kmeans(newCust, centers = k)$cluster)$avg.silwidth
})
sw
[1] 0.4508716 0.4783511 0.3975973 0.3711958 0.3185849 0.3226306
[7] 0.3195372 0.3061333 0.2799187 0.3056202 0.2643999
plot(clus, sw, type = "l", xlab = "number of clusters", ylab = "average cilhouette width")
The graph above indicates that that 3 clusters would be optimal.
clus[which.max(sw)]
[1] 3
Let’s create the three clusters.
set.seed(22)
fit = kmeans(newCust, 3)
fit
K-means clustering with 3 clusters of sizes 330, 50, 60
Cluster means:
Fresh Milk Grocery Frozen Detergents_Paper
1 8253.47 3824.603 5280.455 2572.661 1773.058
2 8000.04 18511.420 27573.900 1996.680 12407.360
3 35941.40 6044.450 6288.617 6713.967 1039.667
Delicassen
1 1137.497
2 2252.020
3 3049.467
Clustering vector:
[1] 1 1 1 1 3 1 1 1 1 2 1 1 3 1 3 1 1 1 1 1 1 1 3 2 3 1 1 1 2 3
[31] 1 1 1 3 1 1 3 1 2 3 3 1 1 2 1 2 2 2 1 2 1 1 3 1 3 1 2 1 1 1
[61] 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 3 1 3
[91] 1 1 2 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1
[121] 1 1 1 1 3 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 2 1 1 1 3
[151] 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 3 1 1 1
[181] 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 2 2 3 1 1 2 1 1 1 2
[211] 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3
[241] 3 3 1 1 1 1 1 1 1 1 1 2 1 3 1 3 1 1 3 3 1 1 3 1 1 2 2 1 2 1
[271] 1 1 1 3 1 1 3 1 1 1 1 1 3 3 3 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1
[301] 1 2 1 1 2 1 2 1 1 2 1 3 2 1 1 1 1 1 1 2 1 1 1 1 3 3 1 1 1 1
[331] 1 2 1 2 1 3 1 1 1 1 1 1 1 2 1 1 1 3 1 2 1 2 1 2 1 1 1 1 1 1
[361] 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 1 1 3 1 3 1 2 1 1 1 1 1
[391] 1 1 1 3 1 1 1 1 1 1 1 3 3 3 1 1 3 2 1 1 1 1 1 1 1 1 1 1 2 1
[421] 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 3 3 2 1 1
Within cluster sum of squares by cluster:
[1] 28184318853 26382784678 25765310312
(between_SS / total_SS = 49.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
barplot(t(fit$centers), beside = TRUE, xlab="cluster", ylab="value")
As shown above, the fast majority of people who tended to shop for fresh produce were grouped in cluster 3. The company would be wise to market their fresh produce to group 3. The also might benefit from marketing their grocery products to the customers in group 2.
library(cluster)
clusplot(newCust, fit$cluster, color = TRUE, shade = TRUE)
The clusters above contain one of size 330, 50, and 60 for cluster 1, 2 and 3 respectively.
mds = cmdscale(dist(newCust), k = 2)
plot(mds, col = fit$cluster)
Let’s see if this clustering can predict region and channel. I sort of doubt that these variables are related.
table(customers$Channel, fit$cluster)
1 2 3
1 244 2 52
2 86 48 8
table(customers$Region, fit$cluster)
1 2 3
1 56 10 11
2 35 8 4
3 239 32 45
These tables also suggest that the clusters that were created based on the spending data are not indicative of what region or what channel the money was being spent at.
Looking again at the correlation matrix, lets pick two variables with a very weak correlation.
library(Hmisc)
y <- rcorr(as.matrix(newCust))
y
Fresh Milk Grocery Frozen Detergents_Paper
Fresh 1.00 0.10 -0.01 0.35 -0.10
Milk 0.10 1.00 0.73 0.12 0.66
Grocery -0.01 0.73 1.00 -0.04 0.92
Frozen 0.35 0.12 -0.04 1.00 -0.13
Detergents_Paper -0.10 0.66 0.92 -0.13 1.00
Delicassen 0.24 0.41 0.21 0.39 0.07
Delicassen
Fresh 0.24
Milk 0.41
Grocery 0.21
Frozen 0.39
Detergents_Paper 0.07
Delicassen 1.00
n= 440
P
Fresh Milk Grocery Frozen Detergents_Paper
Fresh 0.0351 0.8042 0.0000 0.0325
Milk 0.0351 0.0000 0.0092 0.0000
Grocery 0.8042 0.0000 0.4003 0.0000
Frozen 0.0000 0.0092 0.4003 0.0057
Detergents_Paper 0.0325 0.0000 0.0000 0.0057
Delicassen 0.0000 0.0000 0.0000 0.0000 0.1468
Delicassen
Fresh 0.0000
Milk 0.0000
Grocery 0.0000
Frozen 0.0000
Detergents_Paper 0.1468
Delicassen
Let’s look at another pair of products that have no correlation.
plot(newCust[c("Detergents_Paper", "Milk")], col = fit$cluster, cex = .5)
points(fit$centers[,c("Detergents_Paper", "Milk")], col=5, pch="X")
This graph doesn’t seem to lead to much of any valuable insight.
plot(newCust[c("Frozen", "Grocery")], col = fit$cluster, cex = .5)
points(fit$centers[,c("Frozen", "Grocery")], col=5, pch="X")
It looks as though the people in the red cluster tend to buy groceries at a much higher rate than frozen foods.
While there are many more explorations that we could perform on our clusters, lets shift gears to HCA clustering.
HCA CLUSTERING
First we load the data.
data <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"), header = TRUE, sep = ",")
Let’s again remove the region and channel variables and we’ll need to normalize the data.
data$Channel <- NULL
data$Region <-NULL
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
data[1:6] <- as.data.frame(lapply(data[1:6], normalize))
We’ll use packages “NbClust” and “factoextra” to choose the optimal number of clusters for our HCA analysis.
install.packages("NbClust")
Error in install.packages : Updating loaded packages
install.packages("factoextra")
Error in install.packages : Updating loaded packages
library(NbClust)
library(factoextra)
For the first analysis we’ll use ward.D2 linkage.
nb <- NbClust(data, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "ward.D2")
NaNs produced
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 2 proposed 2 as the best number of clusters
* 5 proposed 3 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 4 proposed 5 as the best number of clusters
* 3 proposed 6 as the best number of clusters
* 4 proposed 7 as the best number of clusters
* 3 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
fviz_nbclust(nb)
Among all indices:
===================
* 2 proposed 0 as the best number of clusters
* 1 proposed 1 as the best number of clusters
* 2 proposed 2 as the best number of clusters
* 5 proposed 3 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 4 proposed 5 as the best number of clusters
* 3 proposed 6 as the best number of clusters
* 4 proposed 7 as the best number of clusters
* 3 proposed 10 as the best number of clusters
Conclusion
=========================
* According to the majority rule, the best number of clusters is 3 .
As shown above, the optimal number of clusters is 3.
hc = hclust(dist(data, method="euclidean"), method="ward.D2")
hc
Call:
hclust(d = dist(data, method = "euclidean"), method = "ward.D2")
Cluster method : ward.D2
Distance : euclidean
Number of objects: 440
plot(hc, hang = -0.01, cex = 0.7)
We’ll now cut the dentrogram into three clusters.
fit <- cutree(hc, k = 3)
table(fit)
fit
1 2 3
310 125 5
As seen above, 310 of the observations are contained in cluster 1, 125 in cluster 2, and 5 in cluster 3.
plot(hc)
rect.hclust(hc, k = 4, border = "red")
The clusters are shown above.
Let’s also create clusters with “single” linkage.
nb2 <- NbClust(data, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "single")
NaNs produced
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 4 proposed 5 as the best number of clusters
* 3 proposed 7 as the best number of clusters
* 5 proposed 8 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
fviz_nbclust(nb2)
Among all indices:
===================
* 2 proposed 0 as the best number of clusters
* 1 proposed 1 as the best number of clusters
* 8 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 4 proposed 5 as the best number of clusters
* 3 proposed 7 as the best number of clusters
* 5 proposed 8 as the best number of clusters
Conclusion
=========================
* According to the majority rule, the best number of clusters is 2 .
2 clusters is optimal for this method.
hc2 <- hclust(dist(data), method = "single")
plot(hc2, hang = -.01, cex = .7)
hc2
Call:
hclust(d = dist(data), method = "single")
Cluster method : single
Distance : euclidean
Number of objects: 440
fit2 <- cutree(hc2, k = 2)
table(fit2)
fit2
1 2
439 1
As shown above, the single method didn’t do any meaningful clustering that we can work with. Let’s analyze our ward.D2 linkage clusters instead.
plot(hc)
rect.hclust(hc, k = 4, border = "red")
fit5 <- cutree(hc, h = 3)
table(fit5)
fit5
1 2
310 130
plot(hc)
rect.hclust(hc, h = 3, border = "red")
fit5 <- cutree(hc, h = .5)
table(fit5)
fit5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
39 48 25 19 53 47 76 32 1 19 16 36 2 5 2 1 1 7 3 1 1 4
23 24
1 1
plot(hc)
rect.hclust(hc, h = .5, border = "red")
Regardless of the where we choose to cut the dendrogram into clusters, there seem to just be too many observations. The dendrogram doesn’t have much value when there are 440 customers all listed by a different number. Therefore, I think HCA would have more value on a smaller data set where the observations are distinguished by name. I’m also aware that my knowledge of how to build and edit dendrograms is limited, so perhaps there is a way to obtain more value from this data set using HCA.