Market segmentation is the process of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers (known as segments) based on some type of shared characteristics. In dividing or segmenting markets, researchers typically look for common characteristics such as shared needs, common interests, similar lifestyles or even similar demographic profiles. The overall aim of segmentation is to identify high yield segments – that is, those segments that are likely to be the most profitable or that have growth potential – so that these can be selected for special attention (i.e. become target markets). [Source: Wikipedia]
First a little introduction of the K-Means algorithm which I am going to implement here. The following process describes how this method is implemented.
k initial “means” are randomly generated within the data domain.
k clusters are created by associating every observation with the nearest mean.
The centroid of each of the k clusters becomes the new mean.
Steps 2 and 3 are repeated until convergence has been reached.
[Source: Wikipedia]
In this blog I am looking to implement a K-Means clustering algorithm to a dataset from the trusted UCI Machine Learning Repository. The Wholesale Customers dataset consists of 8 attributes and 440 Instances. It can be downloaded from here.
Attribute Information:
(Minimum, Maximum, Mean, Std. Deviation) FRESH ( 3, 112151, 12000.30, 12647.329) MILK (55, 73498, 5796.27, 7380.377) GROCERY (3, 92780, 7951.28, 9503.163) FROZEN (25, 60869, 3071.93, 4854.673) DETERGENTS_PAPER (3, 40827, 2881.49, 4767.854) DELICATESSEN (3, 47943, 1524.87, 2820.106)
REGION Frequency Lisbon 77 Oporto 47 Other Region 316 Total 440
CHANNEL Frequency Horeca 298 Retail 142 Total 440
Now let us jump straight into the code.
#Call necessary libraries
library(corrplot)
## corrplot 0.84 loaded
#Read the input file
wholecust <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"))
Now, let’s understand the dataset by having a peak of the data
head(wholecust)
We can also get a useful summary from the dataset
summary(wholecust)
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
## Median :1.000 Median :3.000 Median : 8504 Median : 3627
## Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
## Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
Let us now explore the dataset more and find out strong correlations among variables.
#Identify strong correlations
w <- cor(wholecust)
corrplot(w, method="number")
From the above diagram we can see that there is strong correlation among the Detergents_Paper and Grocery.
Now we need to find the optimal number of cluster K. Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”. [Source: Wikipedia]
set.seed(123)
k.max <- 15
data <- as.matrix(scale(wholecust[,(3:8)]))
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters",
ylab="Sum of squares")
From the above plot we can see that 3 or 5 is the optimal number of clusters, as we can see that after these numbers the curve remains less changing.
Let us plot out the clusters for both the scenarios
kmm <- kmeans(wholecust[,(3:8)], 3)
kmm
## K-means clustering with 3 clusters of sizes 330, 60, 50
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
## 2 35941.40 6044.450 6288.617 6713.967 1039.667 3049.467
## 3 8000.04 18511.420 27573.900 1996.680 12407.360 2252.020
##
## Clustering vector:
## [1] 1 1 1 1 2 1 1 1 1 3 1 1 2 1 2 1 1 1 1 1 1 1 2 3 2 1 1 1 3 2 1 1 1 2 1 1 2
## [38] 1 3 2 2 1 1 3 1 3 3 3 1 3 1 1 2 1 2 1 3 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1
## [75] 1 1 1 3 1 1 1 1 1 1 1 3 3 2 1 2 1 1 3 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 3 1
## [112] 3 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 3 1 1
## [149] 1 2 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 3 1 1 1 1 1 3 1 3 1 1 2 1 1 1 1 2 1 2 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 3 3 2 1 1 3 1 1 1 3 1 3 1 1 1 1 3 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 3 1 2 1 2 1 1 2
## [260] 2 1 1 2 1 1 3 3 1 3 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2 2 1 1 1 2 1 1 1 1 1 1
## [297] 1 1 1 1 1 3 1 1 3 1 3 1 1 3 1 2 3 1 1 1 1 1 1 3 1 1 1 1 2 2 1 1 1 1 1 3 1
## [334] 3 1 2 1 1 1 1 1 1 1 3 1 1 1 2 1 3 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 2 1 1 1 1 1 1 2 1 1 2 1 2 1 3 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1 1 2
## [408] 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 3 1 1
##
## Within cluster sum of squares by cluster:
## [1] 28184318853 25765310312 26382784678
## (between_SS / total_SS = 49.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Now let’s implement K means for k=5
kmm <- kmeans(wholecust[,(3:8)], 5)
kmm
## K-means clustering with 5 clusters of sizes 104, 10, 223, 80, 23
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 21200.058 3886.423 5138.933 4119.856 1131.5192 1690.3365
## 2 21263.700 37443.300 46710.600 6287.200 21699.4000 8743.3000
## 3 6052.812 3266.314 4092.054 2459.682 1214.1300 990.6099
## 4 4738.762 11609.013 18198.775 1515.388 8003.7750 1603.8000
## 5 49296.087 4983.783 5590.304 8285.783 962.2609 2543.6957
##
## Clustering vector:
## [1] 3 3 3 3 1 3 3 3 3 4 4 3 1 1 1 3 4 3 1 3 1 3 1 2 1 1 3 1 4 5 1 3 1 1 3 3 1
## [38] 4 4 5 1 1 4 4 3 4 4 2 3 4 3 3 5 4 1 3 4 4 3 3 3 2 3 4 3 2 3 1 3 3 1 1 3 1
## [75] 3 1 3 4 3 3 3 4 4 1 3 2 2 5 3 1 3 3 2 1 4 3 3 3 3 3 4 4 3 5 1 1 4 4 3 4 3
## [112] 4 1 1 1 3 3 3 1 3 1 3 3 3 5 5 1 1 3 5 3 3 1 3 3 3 3 3 3 3 1 1 5 3 1 4 3 3
## [149] 3 1 1 3 1 3 3 4 4 1 3 4 3 3 1 4 3 4 3 3 3 3 4 4 3 4 3 4 5 3 3 3 3 5 4 2 3
## [186] 3 3 3 4 4 1 3 3 4 3 1 1 3 3 3 4 4 1 3 3 4 3 3 3 4 1 2 3 3 4 4 4 1 4 3 1 3
## [223] 3 3 3 3 1 3 3 3 3 3 1 3 1 3 3 1 3 5 1 1 1 3 3 4 3 3 1 3 3 4 3 1 3 1 3 3 5
## [260] 5 3 3 1 3 4 4 4 1 4 1 3 3 3 5 3 3 1 3 3 1 3 3 5 1 5 5 3 1 1 5 3 3 3 4 1 3
## [297] 1 3 3 3 1 4 3 4 4 4 4 1 3 4 3 1 4 3 3 4 3 3 3 4 3 3 1 3 1 5 3 3 1 3 3 4 1
## [334] 2 1 1 3 3 3 3 3 3 3 4 3 3 4 1 3 4 3 4 3 4 1 3 1 4 3 3 1 3 3 3 3 3 3 3 1 3
## [371] 5 1 3 1 3 3 4 5 3 3 1 1 1 3 4 3 3 1 3 3 3 3 3 1 3 3 4 3 3 3 3 1 1 1 1 3 1
## [408] 4 3 3 3 3 3 3 3 3 4 3 4 3 4 1 1 1 1 3 4 1 3 3 4 3 1 3 1 1 5 4 3 3
##
## Within cluster sum of squares by cluster:
## [1] 8521349738 14108802241 10060038988 8835879467 11679101316
## (between_SS / total_SS = 66.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Let us closely look at the cluster means of both scenarios:
Scenario 1: k = 3
Cluster 1 - highest fresh-products. Cluster 2 - low spenders. Cluster 3 - highest milk, grocery, detergents_papers spenders.
Scenario 2: k = 5
Cluster 1 - low spenders Cluster 2 - highest Fresh spenders Cluster 3 - mediocre spenders Cluster 4 - low spenders Cluster 5 - mediocre Fresh, highest milk, Grocery, detergents_papers
From the above 2 analysis we can see that 3 clusters prove to be the base optimal number for quickly understanding the customer segmentation.
Possible extensions to the project is to use Principal Component Analysis for dimensionality reduction.