Mayuri Ingle
01/15/2019
In this project, we will analyze a dataset from UCI Machine Learning Repository containing data that refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories
Goal of the project is to seperate or group custmers based on their spending pattern .This will help distributer to understand the variation in customer demands as per the group to which they belong .
Let’s get started ….
# Import all required libraries
library(tidyverse)
## -- Attaching packages ------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 2.0.1 v dplyr 0.7.8
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.3.1 v forcats 0.3.0
## -- Conflicts --------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(cluster)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Loading data set
wsCustomer <- read_csv("F:\\RFiles\\Datasets\\Wholesale customers data.csv")
## Parsed with column specification:
## cols(
## Channel = col_double(),
## Region = col_double(),
## Fresh = col_double(),
## Milk = col_double(),
## Grocery = col_double(),
## Frozen = col_double(),
## Detergents_Paper = col_double(),
## Delicassen = col_double()
## )
# View first 5 rows of dataset
head(wsCustomer)
## # A tibble: 6 x 8
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 4 1 3 13265 1196 4221 6404 507 1788
## 5 2 3 22615 5410 7198 3915 1777 5185
## 6 2 3 9413 8259 5126 666 1795 1451
The dataset has 8 columns , first 2 columns have details of Channel and Region . Rest 6 columns have details of Products with their annual spending .
In this section we will explore our data to understand the features and their relation with each other.
The dataset have six important product categories: ‘Fresh’, ‘Milk’, ‘Grocery’, ‘Frozen’, ‘Detergents_Paper’, and ‘Delicatessen’. Let’s view the statistical summary for each of above product categories.
# Summary of product categories
summary(wsCustomer[3:8])
## Fresh Milk Grocery Frozen
## Min. : 3 Min. : 55 Min. : 3 Min. : 25.0
## 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 1st Qu.: 742.2
## Median : 8504 Median : 3627 Median : 4756 Median : 1526.0
## Mean : 12000 Mean : 5796 Mean : 7951 Mean : 3071.9
## 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 3rd Qu.: 3554.2
## Max. :112151 Max. :73498 Max. :92780 Max. :60869.0
## Detergents_Paper Delicassen
## Min. : 3.0 Min. : 3.0
## 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 816.5 Median : 965.5
## Mean : 2881.5 Mean : 1524.9
## 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :40827.0 Max. :47943.0
Well, the summary shows large difference in min and max spending of customers , this gives intial hint that there are low as well as high spending clients to distributer.
Let’s quickly check for missing data .
sum(is.na(wsCustomer))
## [1] 0
There is no missing data in our dataset.
We will begin with next steps to preprocess our data by replacing Region and Channel column with their names, and then scaling data .
As per data description , the Channel and Region columns details are as below:
Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel
Frequency : Horeca 298 , Retail 142
Region - Lisbon, Oporto or Other
Frequency : Lisbon 77 , Oporto 47 ,Other Region 316
We will replace the values in these columns with respective names so we can use this information to analyse the Regions and Channels in particular cluster.
#Converting Region and Channel columns ; replacing values by names
wsCustomer <- wsCustomer %>% mutate(Channel = ifelse(Channel == 1 , "HoReCa","Retail"),
Region = case_when(Region == 1 ~ "Lisbon",
Region == 2 ~ "Oporto",
Region == 3 ~ "Others"))
head(wsCustomer)
## # A tibble: 6 x 8
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Retail Others 12669 9656 7561 214 2674 1338
## 2 Retail Others 7057 9810 9568 1762 3293 1776
## 3 Retail Others 6353 8808 7684 2405 3516 7844
## 4 HoReCa Others 13265 1196 4221 6404 507 1788
## 5 Retail Others 22615 5410 7198 3915 1777 5185
## 6 Retail Others 9413 8259 5126 666 1795 1451
As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale:
wsCustomerSc <- as_tibble(scale(wsCustomer[3:8]))
head(wsCustomerSc)
## # A tibble: 6 x 6
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0529 0.523 -0.0411 -0.589 -0.0435 -0.0663
## 2 -0.391 0.544 0.170 -0.270 0.0863 0.0890
## 3 -0.447 0.408 -0.0281 -0.137 0.133 2.24
## 4 0.1000 -0.623 -0.393 0.686 -0.498 0.0933
## 5 0.839 -0.0523 -0.0793 0.174 -0.232 1.30
## 6 -0.205 0.334 -0.297 -0.496 -0.228 -0.0262
As of now the Channel and Region columns are excluded as they do not refer to spending information.
In this section, we use the K-Means clustering algorithm to identify the various customer segments hidden in the data. We will then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.
K-Means Clustering :
K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity).
First we need to find the optimal clusters , as we have to specify K (no of clusters) for clustering.
The following are the three most popular methods for determining the optimal clusters, which includes:
Elbow method
Silhouette method
Gap statistic
In Kmeans ,the total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.
Thus We can use the following algorithm to define the optimal clusters in Elbow method:
1)Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
2)For each k, calculate the total within-cluster sum of square (wss)
3)Plot the curve of wss according to the number of clusters k.
4)The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
Fortunately, this process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust):
set.seed(123)
fviz_nbclust(wsCustomerSc , kmeans, method = "wss")
The elbow method graph do not show a sharp knee bend in this case, but we can consider k value as 5 .
The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the average silhouette coefficient provides a simple scoring method of a given clustering.
A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k
Similar to the elbow method, this process to compute the “average silhoutte method” has been wrapped up in a single function (fviz_nbclust):
set.seed(123)
fviz_nbclust(wsCustomerSc, kmeans, method = "silhouette")
Silhouette method shows that optimal number of cluster are 2.
The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001).The approach can be applied to any clustering method.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
To compute the gap statistic method we can use the clusGap function which provides the gap statistic and standard error for an output .We can visualize the results with fviz_gap_stat which suggests the optimal number of clusters.
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(wsCustomerSc, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
fviz_gap_stat(gap_stat)
Gap Statistic method shows that optimal number of clusters are 3.
With above estimates for K , we will compute Kmeans clustering for K = 2 ,3 and 5. We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
# Compute Kmeans for K = 2 ,3 and 5
set.seed(123)
k2 <- kmeans(wsCustomerSc, centers = 2, nstart = 30)
k3 <- kmeans(wsCustomerSc, centers = 3, nstart = 30)
k5 <- kmeans(wsCustomerSc, centers = 5, nstart = 30)
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = wsCustomerSc) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = wsCustomerSc) + ggtitle("k = 3")
p3 <- fviz_cluster(k5, geom = "point", data = wsCustomerSc) + ggtitle("k = 4")
grid.arrange(p1, p2, p3, nrow = 2)
Above visual assessment shows 2 and 3 clusters seperate the data in distinct group, also calculated by silhoutte and gap statistic method.
Let’s have a look at cluster details in above cases.
print(k2)
## K-means clustering with 2 clusters of sizes 399, 41
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper
## 1 -0.00542930 -0.2122882 -0.2302493 -0.03310806 -0.2320799
## 2 0.05283636 2.0659269 2.2407190 0.32219794 2.2585338
## Delicassen
## 1 -0.0826124
## 2 0.8039597
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1
## [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
## [176] 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2
## [211] 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1
## [316] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2
## [351] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [386] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [421] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 982.9619 966.3860
## (between_SS / total_SS = 26.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
print(k3)
## K-means clustering with 3 clusters of sizes 44, 3, 393
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper
## 1 -0.306069120 1.8098555 2.2038591 -0.24975466 2.2309315
## 2 3.840444845 3.2957757 0.9852919 7.20489292 -0.1527927
## 3 0.004950908 -0.2277887 -0.2542638 -0.02703683 -0.2486071
## Delicassen
## 1 0.25139852
## 2 6.79967230
## 3 -0.08005229
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 3 3 3
## [36] 3 3 3 1 3 3 3 3 1 3 1 1 1 3 1 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 3
## [71] 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 1 3 1 3
## [176] 3 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 3 3 3 1
## [211] 3 1 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 1 3 3 1 3 3
## [316] 3 3 3 3 1 3 3 3 3 3 2 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1
## [351] 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [386] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [421] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3
##
## Within cluster sum of squares by cluster:
## [1] 441.0021 214.5396 944.8291
## (between_SS / total_SS = 39.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
From above details we can conclude that cluster size of 3 will be suitable for us, as it separates the high variation observations in a seperate group. This cluster can include potential high spending customers .
Thus we will calculate our final analysis using 3 as optimal clusters.
# Compute k-means clustering with k = 3
set.seed(123)
final <- kmeans(wsCustomerSc, 3, nstart = 30)
print(final)
## K-means clustering with 3 clusters of sizes 44, 3, 393
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper
## 1 -0.306069120 1.8098555 2.2038591 -0.24975466 2.2309315
## 2 3.840444845 3.2957757 0.9852919 7.20489292 -0.1527927
## 3 0.004950908 -0.2277887 -0.2542638 -0.02703683 -0.2486071
## Delicassen
## 1 0.25139852
## 2 6.79967230
## 3 -0.08005229
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 3 3 3
## [36] 3 3 3 1 3 3 3 3 1 3 1 1 1 3 1 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 3
## [71] 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 1 3 1 3
## [176] 3 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 3 3 3 1
## [211] 3 1 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 1 3 3 1 3 3
## [316] 3 3 3 3 1 3 3 3 3 3 2 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1
## [351] 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [386] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [421] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3
##
## Within cluster sum of squares by cluster:
## [1] 441.0021 214.5396 944.8291
## (between_SS / total_SS = 39.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
#Visualize final clusters
fviz_cluster(final, geom = "point", data = wsCustomerSc)
And we can extract the clusters and add to our initial data to do some descriptive statistics at the cluster level:
# Calulate cluster level means
ClustMeans <- wsCustomer[3:8] %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
ClustMeans
## # A tibble: 3 x 7
## Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 8129. 19154. 28895. 1859. 13518. 2234.
## 2 2 60572. 30120. 17315. 38049. 2153 20701.
## 3 3 12063. 4115. 5535. 2941. 1696. 1299.
# Visualize customer segments with average product spending
ClustMeans %>%
gather(Product, MU, Fresh:Delicassen)%>%
ggplot(aes(x=Product , y = MU, fill = Product)) + geom_col(width = 0.5) +
facet_grid(.~ Cluster)+
scale_fill_brewer(palette = "Accent")+
ylab("Customer Spending in Monetory Units") +
ggtitle("Customer Segments with Average Product Spending")+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())
Above statistics and its visual analysis shows that customer spending habits vary in each segment. We will conclude them in our last section.
Now let’s analyse the Channel and Region distribution in each cluster.
# Channel and Region in Cluster 1/Segment 1
wsCustomer[1:2] %>%
mutate(Cluster = final$cluster) %>%
filter(Cluster == 1) %>% ungroup() %>% count(Channel , Region)
## # A tibble: 3 x 3
## Channel Region n
## <chr> <chr> <int>
## 1 Retail Lisbon 7
## 2 Retail Oporto 8
## 3 Retail Others 29
# Channel and Region in Cluster 2 /Segment 2
wsCustomer[1:2] %>%
mutate(Cluster = final$cluster) %>%
filter(Cluster == 2) %>% ungroup() %>% count(Channel , Region)
## # A tibble: 2 x 3
## Channel Region n
## <chr> <chr> <int>
## 1 HoReCa Oporto 1
## 2 HoReCa Others 2
# Channel and Region in Cluster 3/Segment 3
wsCustomer[1:2] %>%
mutate(Cluster = final$cluster) %>%
filter(Cluster == 3) %>% ungroup() %>% count(Channel , Region)
## # A tibble: 6 x 3
## Channel Region n
## <chr> <chr> <int>
## 1 HoReCa Lisbon 59
## 2 HoReCa Oporto 27
## 3 HoReCa Others 209
## 4 Retail Lisbon 11
## 5 Retail Oporto 11
## 6 Retail Others 76
From above analysis we can conclude below observations or customer spending habits of each segment identified in clustered data :
Segment 1: This segment best represent and contains only Retail customers who spend mainly on Groceries , Milk followed by Detergents and papers and then on Fresh products.
Segment 2: This segment contains only few Hotel/Restaurant/Cafe customers who spend heavily on Fresh , Frozen followed by Milk and Groceries. And also have highest median spending on Delicassen.These customers form a well seperated group with these spending habit.
Segment 3: This segment consist of majority Hotel/Restaurant/Cafe customers along with Retail Customers who spend decently on Fresh followed by Groceries and Milk,but spend least on Detergents_Paper and Delicassen in all groups.
This concludes our customer segmentation exercise using K-Means algorithm.