Unsupervised Learning Project: Customer Segmentation

Mayuri Ingle
01/15/2019

Introduction :

In this project, we will analyze a dataset from UCI Machine Learning Repository containing data that refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories

Goal of the project is to seperate or group custmers based on their spending pattern .This will help distributer to understand the variation in customer demands as per the group to which they belong .

Let’s get started ….

# Import all required libraries

library(tidyverse)
## -- Attaching packages ------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  2.0.1     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## -- Conflicts --------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(cluster)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## Loading data set

wsCustomer <- read_csv("F:\\RFiles\\Datasets\\Wholesale customers data.csv")
## Parsed with column specification:
## cols(
##   Channel = col_double(),
##   Region = col_double(),
##   Fresh = col_double(),
##   Milk = col_double(),
##   Grocery = col_double(),
##   Frozen = col_double(),
##   Detergents_Paper = col_double(),
##   Delicassen = col_double()
## )
# View first 5 rows of dataset
head(wsCustomer)
## # A tibble: 6 x 8
##   Channel Region Fresh  Milk Grocery Frozen Detergents_Paper Delicassen
##     <dbl>  <dbl> <dbl> <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
## 1       2      3 12669  9656    7561    214             2674       1338
## 2       2      3  7057  9810    9568   1762             3293       1776
## 3       2      3  6353  8808    7684   2405             3516       7844
## 4       1      3 13265  1196    4221   6404              507       1788
## 5       2      3 22615  5410    7198   3915             1777       5185
## 6       2      3  9413  8259    5126    666             1795       1451

The dataset has 8 columns , first 2 columns have details of Channel and Region . Rest 6 columns have details of Products with their annual spending .

Data Exploration :

In this section we will explore our data to understand the features and their relation with each other.

The dataset have six important product categories: ‘Fresh’, ‘Milk’, ‘Grocery’, ‘Frozen’, ‘Detergents_Paper’, and ‘Delicatessen’. Let’s view the statistical summary for each of above product categories.

# Summary of product categories

summary(wsCustomer[3:8])
##      Fresh             Milk          Grocery          Frozen       
##  Min.   :     3   Min.   :   55   Min.   :    3   Min.   :   25.0  
##  1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153   1st Qu.:  742.2  
##  Median :  8504   Median : 3627   Median : 4756   Median : 1526.0  
##  Mean   : 12000   Mean   : 5796   Mean   : 7951   Mean   : 3071.9  
##  3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656   3rd Qu.: 3554.2  
##  Max.   :112151   Max.   :73498   Max.   :92780   Max.   :60869.0  
##  Detergents_Paper    Delicassen     
##  Min.   :    3.0   Min.   :    3.0  
##  1st Qu.:  256.8   1st Qu.:  408.2  
##  Median :  816.5   Median :  965.5  
##  Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :40827.0   Max.   :47943.0

Well, the summary shows large difference in min and max spending of customers , this gives intial hint that there are low as well as high spending clients to distributer.

Let’s quickly check for missing data .

sum(is.na(wsCustomer))
## [1] 0

There is no missing data in our dataset.

Data Preprocessing

We will begin with next steps to preprocess our data by replacing Region and Channel column with their names, and then scaling data .

As per data description , the Channel and Region columns details are as below:
Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel
Frequency : Horeca 298 , Retail 142

Region - Lisbon, Oporto or Other
Frequency : Lisbon 77 , Oporto 47 ,Other Region 316

We will replace the values in these columns with respective names so we can use this information to analyse the Regions and Channels in particular cluster.

#Converting Region and Channel columns ; replacing values by names

wsCustomer <- wsCustomer %>% mutate(Channel = ifelse(Channel == 1 , "HoReCa","Retail"),
                      Region = case_when(Region == 1 ~ "Lisbon",
                                         Region == 2 ~ "Oporto",
                                         Region == 3 ~ "Others"))

head(wsCustomer)
## # A tibble: 6 x 8
##   Channel Region Fresh  Milk Grocery Frozen Detergents_Paper Delicassen
##   <chr>   <chr>  <dbl> <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
## 1 Retail  Others 12669  9656    7561    214             2674       1338
## 2 Retail  Others  7057  9810    9568   1762             3293       1776
## 3 Retail  Others  6353  8808    7684   2405             3516       7844
## 4 HoReCa  Others 13265  1196    4221   6404              507       1788
## 5 Retail  Others 22615  5410    7198   3915             1777       5185
## 6 Retail  Others  9413  8259    5126    666             1795       1451

As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale:

wsCustomerSc <- as_tibble(scale(wsCustomer[3:8])) 

head(wsCustomerSc)
## # A tibble: 6 x 6
##     Fresh    Milk Grocery Frozen Detergents_Paper Delicassen
##     <dbl>   <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
## 1  0.0529  0.523  -0.0411 -0.589          -0.0435    -0.0663
## 2 -0.391   0.544   0.170  -0.270           0.0863     0.0890
## 3 -0.447   0.408  -0.0281 -0.137           0.133      2.24  
## 4  0.1000 -0.623  -0.393   0.686          -0.498      0.0933
## 5  0.839  -0.0523 -0.0793  0.174          -0.232      1.30  
## 6 -0.205   0.334  -0.297  -0.496          -0.228     -0.0262

As of now the Channel and Region columns are excluded as they do not refer to spending information.

Clustering :

In this section, we use the K-Means clustering algorithm to identify the various customer segments hidden in the data. We will then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.

K-Means Clustering :

K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity).

Determining Optimal Clusters

First we need to find the optimal clusters , as we have to specify K (no of clusters) for clustering.

The following are the three most popular methods for determining the optimal clusters, which includes:

Elbow method
Silhouette method
Gap statistic

Elbow Method

In Kmeans ,the total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.

Thus We can use the following algorithm to define the optimal clusters in Elbow method:

1)Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters
2)For each k, calculate the total within-cluster sum of square (wss)
3)Plot the curve of wss according to the number of clusters k.
4)The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

Fortunately, this process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust):

set.seed(123)

fviz_nbclust(wsCustomerSc , kmeans, method = "wss")

The elbow method graph do not show a sharp knee bend in this case, but we can consider k value as 5 .

Silhouette method

The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the average silhouette coefficient provides a simple scoring method of a given clustering.

A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k

Similar to the elbow method, this process to compute the “average silhoutte method” has been wrapped up in a single function (fviz_nbclust):

set.seed(123)

fviz_nbclust(wsCustomerSc, kmeans, method = "silhouette")

Silhouette method shows that optimal number of cluster are 2.

Gap Statistic

The gap statistic has been published by R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001).The approach can be applied to any clustering method.

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.

To compute the gap statistic method we can use the clusGap function which provides the gap statistic and standard error for an output .We can visualize the results with fviz_gap_stat which suggests the optimal number of clusters.

# compute gap statistic
set.seed(123)
gap_stat <- clusGap(wsCustomerSc, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
fviz_gap_stat(gap_stat)

Gap Statistic method shows that optimal number of clusters are 3.

With above estimates for K , we will compute Kmeans clustering for K = 2 ,3 and 5. We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

# Compute Kmeans for K = 2 ,3 and  5
set.seed(123)

k2 <- kmeans(wsCustomerSc, centers = 2, nstart = 30)
k3 <- kmeans(wsCustomerSc, centers = 3, nstart = 30)
k5 <- kmeans(wsCustomerSc, centers = 5, nstart = 30)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = wsCustomerSc) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = wsCustomerSc) + ggtitle("k = 3")
p3 <- fviz_cluster(k5, geom = "point",  data = wsCustomerSc) + ggtitle("k = 4")



grid.arrange(p1, p2, p3, nrow = 2)

Above visual assessment shows 2 and 3 clusters seperate the data in distinct group, also calculated by silhoutte and gap statistic method.

Let’s have a look at cluster details in above cases.

print(k2)
## K-means clustering with 2 clusters of sizes 399, 41
## 
## Cluster means:
##         Fresh       Milk    Grocery      Frozen Detergents_Paper
## 1 -0.00542930 -0.2122882 -0.2302493 -0.03310806       -0.2320799
## 2  0.05283636  2.0659269  2.2407190  0.32219794        2.2585338
##   Delicassen
## 1 -0.0826124
## 2  0.8039597
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1
##  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
## [176] 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2
## [211] 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1
## [316] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2
## [351] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [386] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [421] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 982.9619 966.3860
##  (between_SS / total_SS =  26.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
print(k3)
## K-means clustering with 3 clusters of sizes 44, 3, 393
## 
## Cluster means:
##          Fresh       Milk    Grocery      Frozen Detergents_Paper
## 1 -0.306069120  1.8098555  2.2038591 -0.24975466        2.2309315
## 2  3.840444845  3.2957757  0.9852919  7.20489292       -0.1527927
## 3  0.004950908 -0.2277887 -0.2542638 -0.02703683       -0.2486071
##    Delicassen
## 1  0.25139852
## 2  6.79967230
## 3 -0.08005229
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 3 3 3
##  [36] 3 3 3 1 3 3 3 3 1 3 1 1 1 3 1 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 3
##  [71] 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 1 3 1 3
## [176] 3 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 3 3 3 1
## [211] 3 1 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 1 3 3 1 3 3
## [316] 3 3 3 3 1 3 3 3 3 3 2 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1
## [351] 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [386] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [421] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 441.0021 214.5396 944.8291
##  (between_SS / total_SS =  39.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

From above details we can conclude that cluster size of 3 will be suitable for us, as it separates the high variation observations in a seperate group. This cluster can include potential high spending customers .

Thus we will calculate our final analysis using 3 as optimal clusters.

# Compute k-means clustering with k = 3
set.seed(123)
final <- kmeans(wsCustomerSc, 3, nstart = 30)
print(final)
## K-means clustering with 3 clusters of sizes 44, 3, 393
## 
## Cluster means:
##          Fresh       Milk    Grocery      Frozen Detergents_Paper
## 1 -0.306069120  1.8098555  2.2038591 -0.24975466        2.2309315
## 2  3.840444845  3.2957757  0.9852919  7.20489292       -0.1527927
## 3  0.004950908 -0.2277887 -0.2542638 -0.02703683       -0.2486071
##    Delicassen
## 1  0.25139852
## 2  6.79967230
## 3 -0.08005229
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 3 3 3
##  [36] 3 3 3 1 3 3 3 3 1 3 1 1 1 3 1 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 3
##  [71] 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 1 3 1 3
## [176] 3 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 3 3 3 1
## [211] 3 1 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [246] 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 1 3 3 1 3 3
## [316] 3 3 3 3 1 3 3 3 3 3 2 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1
## [351] 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [386] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3
## [421] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 441.0021 214.5396 944.8291
##  (between_SS / total_SS =  39.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
#Visualize final clusters

fviz_cluster(final, geom = "point",  data = wsCustomerSc)

And we can extract the clusters and add to our initial data to do some descriptive statistics at the cluster level:

# Calulate cluster level means
ClustMeans <- wsCustomer[3:8] %>%
  mutate(Cluster = final$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")

ClustMeans
## # A tibble: 3 x 7
##   Cluster  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
##     <int>  <dbl>  <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
## 1       1  8129. 19154.  28895.  1859.           13518.      2234.
## 2       2 60572. 30120.  17315. 38049.            2153      20701.
## 3       3 12063.  4115.   5535.  2941.            1696.      1299.
# Visualize customer segments with average product spending  

ClustMeans %>% 
  gather(Product, MU, Fresh:Delicassen)%>%
    ggplot(aes(x=Product , y = MU, fill = Product)) + geom_col(width = 0.5) + 
              facet_grid(.~ Cluster)+ 
                      scale_fill_brewer(palette = "Accent")+
                          ylab("Customer Spending in Monetory Units") +
                    ggtitle("Customer Segments with Average Product Spending")+
                                    theme(axis.text.x = element_blank(),
                                          axis.ticks.x = element_blank(),
                                          axis.title.x = element_blank())

Above statistics and its visual analysis shows that customer spending habits vary in each segment. We will conclude them in our last section.

Now let’s analyse the Channel and Region distribution in each cluster.

# Channel and Region in Cluster 1/Segment 1

wsCustomer[1:2] %>%
  mutate(Cluster = final$cluster) %>%
  filter(Cluster == 1) %>% ungroup() %>% count(Channel , Region)
## # A tibble: 3 x 3
##   Channel Region     n
##   <chr>   <chr>  <int>
## 1 Retail  Lisbon     7
## 2 Retail  Oporto     8
## 3 Retail  Others    29
# Channel and Region in Cluster 2 /Segment 2

wsCustomer[1:2] %>%
  mutate(Cluster = final$cluster) %>%
  filter(Cluster == 2) %>% ungroup() %>% count(Channel , Region)
## # A tibble: 2 x 3
##   Channel Region     n
##   <chr>   <chr>  <int>
## 1 HoReCa  Oporto     1
## 2 HoReCa  Others     2
# Channel and Region in Cluster 3/Segment 3

wsCustomer[1:2] %>%
  mutate(Cluster = final$cluster) %>%
  filter(Cluster == 3) %>% ungroup() %>% count(Channel , Region)
## # A tibble: 6 x 3
##   Channel Region     n
##   <chr>   <chr>  <int>
## 1 HoReCa  Lisbon    59
## 2 HoReCa  Oporto    27
## 3 HoReCa  Others   209
## 4 Retail  Lisbon    11
## 5 Retail  Oporto    11
## 6 Retail  Others    76

Conclusion :

From above analysis we can conclude below observations or customer spending habits of each segment identified in clustered data :

Segment 1: This segment best represent and contains only Retail customers who spend mainly on Groceries , Milk followed by Detergents and papers and then on Fresh products.

Segment 2: This segment contains only few Hotel/Restaurant/Cafe customers who spend heavily on Fresh , Frozen followed by Milk and Groceries. And also have highest median spending on Delicassen.These customers form a well seperated group with these spending habit.

Segment 3: This segment consist of majority Hotel/Restaurant/Cafe customers along with Retail Customers who spend decently on Fresh followed by Groceries and Milk,but spend least on Detergents_Paper and Delicassen in all groups.

This concludes our customer segmentation exercise using K-Means algorithm.