Clustering

Introduction

This markdown document comprises of followings:Dataset is specially selected to showcase different technical aspects of a project and their business value(s).

Primary objectives are to demonstrate following knowledge:

Explore, understand & cleanse data with explained reasoning (Exploratory Data Analysis)

Identify, implement, evaluate the best possible statistics model based on data-based proof

Systematic documentation with reproducible R coding for others to understand & improve

Summarise observed findings with concise business values & insights through executive summary

Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application

Overview

For this project, the use case is to find the segment group of wholesale customers based on the item purchased. Dataset used will be wholesale customer dataset from public domain.

This dataset is chosen as clustering customer segment is one of the common topics seen. The clustering of variables/ features using unsupervised learning often allow discovery of inter class high similarity and intra class high dis-similarity. This clustering is easily reproducible in other context beside customer segment. Other domains such as automobile insurance, credit card users, telco customer churn and ecommerce shoppers.

As these modeling are based on historical data with certain assumption. This project did not take into other external factors such as mode of purchase (credit card/ cash), direct delivery or self collect and time of purchase transaction.

Data Dictionary

FRESH: annual spending (m.u.) on fresh products (Continuous)
MILK: annual spending (m.u.) on milk products (Continuous)
GROCERY: annual spending (m.u.)on grocery products (Continuous)
FROZEN: annual spending (m.u.)on frozen products (Continuous)
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous)
CHANNEL: customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
REGION: customers Region - Lisnon, Oporto or Other (Nominal)

Exploratory Data Analysis

library(stats)
library(ggplot2)
library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(corrplot)

## corrplot 0.84 loaded

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(NbClust)
library (cluster)
df.original<-read.csv('C:/Users/user/Desktop/NewProject/Wholesale customers data.csv')
set.seed(1) #Ensure reproducable code 
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
head(df)

##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       1      2 13360  944   11593    915             1679        573
## 2       2      3  4822 6721    9170    993             4973       3637
## 3       1      3   140 8847    3823    142             1062          3
## 4       1      3  5065 5499   11055    364             3485       1063
## 5       2      2  6758 4560    9965    934             4538       1037
## 6       1      1 15218  258    1138   2516              333        204

library (cluster)
df.original<-read.csv('Wholesale customers data.csv')
set.seed(1) #Ensure reproducable code 
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
head(df)

##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       1      2 13360  944   11593    915             1679        573
## 2       2      3  4822 6721    9170    993             4973       3637
## 3       1      3   140 8847    3823    142             1062          3
## 4       1      3  5065 5499   11055    364             3485       1063
## 5       2      2  6758 4560    9965    934             4538       1037
## 6       1      1 15218  258    1138   2516              333        204

str(df)

## 'data.frame':    308 obs. of  8 variables:
##  $ Channel         : int  1 2 1 1 2 1 1 2 2 1 ...
##  $ Region          : int  2 3 3 3 2 1 3 2 3 3 ...
##  $ Fresh           : int  13360 4822 140 5065 6758 15218 3009 6468 11867 27901 ...
##  $ Milk            : int  944 6721 8847 5499 4560 258 521 12867 3327 3749 ...
##  $ Grocery         : int  11593 9170 3823 11055 9965 1138 854 21570 4814 6964 ...
##  $ Frozen          : int  915 993 142 364 934 2516 3470 1840 1178 4479 ...
##  $ Detergents_Paper: int  1679 4973 1062 3485 4538 333 949 7558 3837 603 ...
##  $ Delicassen      : int  573 3637 3 1063 1037 204 727 1543 120 2503 ...

We should read through the data type in data frame(df) to ensure the structure do not need any changes.

summary(df)

##     Channel          Region          Fresh            Milk      
##  Min.   :1.000   Min.   :1.000   Min.   :    3   Min.   :   55  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.: 3124   1st Qu.: 1490  
##  Median :1.000   Median :3.000   Median : 8873   Median : 3364  
##  Mean   :1.315   Mean   :2.529   Mean   :11950   Mean   : 5764  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:16851   3rd Qu.: 7175  
##  Max.   :2.000   Max.   :3.000   Max.   :76237   Max.   :54259  
##     Grocery          Frozen        Detergents_Paper   Delicassen     
##  Min.   :  137   Min.   :   25.0   Min.   :    3    Min.   :    3.0  
##  1st Qu.: 2212   1st Qu.:  684.2   1st Qu.:  287    1st Qu.:  411.8  
##  Median : 4684   Median : 1498.0   Median :  811    Median :  923.5  
##  Mean   : 7895   Mean   : 3022.7   Mean   : 2731    Mean   : 1609.5  
##  3rd Qu.:10732   3rd Qu.: 3398.0   3rd Qu.: 3851    3rd Qu.: 1858.8  
##  Max.   :92780   Max.   :60869.0   Max.   :40827    Max.   :47943.0

A quick summary stats on the variables. From here, we can conclude the following:

1.Only the food categories make sense in term of summary stats. 2.There is no need to scale as the unit of measurement are the same as we will be using the product categories for clustering.

summary(is.na(df))

##   Channel          Region          Fresh            Milk        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:308       FALSE:308       FALSE:308       FALSE:308      
##   Grocery          Frozen        Detergents_Paper Delicassen     
##  Mode :logical   Mode :logical   Mode :logical    Mode :logical  
##  FALSE:308       FALSE:308       FALSE:308        FALSE:308

Check if there are any missing values (NA) in each variables. There is a total of 440 obs and all variables shwo false for missing values. We can safely concluded this dataset is quite cleaned. However, if there are NA values, following are the usual approach tackle this problem.

1.Exlcude them using R code: na.rm=TRUE -> Generally used on vector 2.Use default values that is pre-defined based on domain knowledge. Usual suspects are 0, -1 or 9999999999 3.Recode the missing value with a mean value of that variable 4.Omit the records with missing value: na.omit(df) -> Subset of the dataset

table(df$Channel)

## 
##   1   2 
## 211  97

Next, the frequency of the categorical variable are been observed. There are 298 obs that purchased via Horeca (Hotel/Restaurant/Café) and 142 by retail.

table(df$Region)

## 
##   1   2   3 
##  56  33 219

On Region, there are 77 annual transaction from Lisbon, 47 from Oporto and 316 from other Regions.

df %>% 
  group_by(Channel,Region) %>%                            # multiple group columns
  summarise(total_fresh = sum(df$Fresh), total_Milk = sum(df$Milk), total_Grocery= sum(df$Grocery),total_Frozen=sum(df$Frozen),total_Detergents_Paper=sum(df$Detergents_Paper), total_Delicassen= sum(df$Delicassen))  # multiple summary columns

## `summarise()` regrouping output by 'Channel' (override with `.groups` argument)

## # A tibble: 6 x 8
## # Groups:   Channel [2]
##   Channel Region total_fresh total_Milk total_Grocery total_Frozen
##     <int>  <int>       <int>      <int>         <int>        <int>
## 1       1      1     3680600    1775418       2431602       930988
## 2       1      2     3680600    1775418       2431602       930988
## 3       1      3     3680600    1775418       2431602       930988
## 4       2      1     3680600    1775418       2431602       930988
## 5       2      2     3680600    1775418       2431602       930988
## 6       2      3     3680600    1775418       2431602       930988
## # ... with 2 more variables: total_Detergents_Paper <int>,
## #   total_Delicassen <int>

Above show the sum of cost transaction for different product categories. Fresh and Grocery categories are the top sellers.

temp <- reshape(df, direction="long", varying=c("Fresh","Milk","Grocery","Frozen","Detergents_Paper", "Delicassen"), v.names= "Total_price", timevar="Category", time=c("Fresh", "Milk","Grocery","Frozen","Detergents_Paper", "Delicassen"))

ggplot(temp, aes(x=temp$Category, y =temp$Total_price)) +geom_boxplot() +stat_boxplot(geom ='errorbar') + theme(axis.text.x= element_text(angle=90,hjust=1))+ ggtitle("Product Category Distribution")

Above plot observed the following findings:

1.There are quite a few outliers for the above categories. 2 approaches: a.Remove outlier and run k mean algorithm b.Without removing outlier, run k medoids -> Partitioning Around Medoids (PAM) algorithm with manhattan distance. 2.Higher varience in pricing for Fresh and Grocery.

Moving next, we are going to aggregate above plot by year to have a holistic view of the resale transaction.

cor.result<- cor(df)
pairs(df[,-c(1:2)], col=df$Channel, pch=19, lower.panel = NULL) +title(main = "Category by Channel")

## integer(0)

plot(df[,-c(1:2)], col=df$Region, pch=19, lower.panel = NULL)  +title(main = "Category by Region")

## integer(0)

corrplot(cor.result, method="ellipse") +title(main = "Corelation by category")

## numeric(0)

Above plots observed the following findings:

1.Detergents_Paper and Grocery has the highest correlation. Next, we check which variables have outliers.

#Remove Region & Channel Columns as they are not sensible
apply(X= df[,-c(1:2)],MARGIN=2, FUN = function(x)length(boxplot.stats(x)$out))

##            Fresh             Milk          Grocery           Frozen 
##               14               21               16               35 
## Detergents_Paper       Delicassen 
##               19               21

From above, the numeric shown are the number of outliers per category. To remove the outliers, Winsorizing technique will be used. In summary, the outliers will be replaced by a specific percentile of the data, normally 90th or 95th. First, we sort the value out for each category.

sort(boxplot.stats(df$Grocery)$out)

##  [1] 23596 23998 24773 25957 26870 28540 28921 28986 30243 32034 33586 36486
## [13] 39694 55571 59598 92780

Next, examine the 90th, 95th, … percentile

quantile(df$Grocery, probs=seq(from =0.9, to=1,by=0.025))

##      90%    92.5%      95%    97.5%     100% 
## 17938.10 20288.95 23140.30 29394.53 92780.00

From above, 95% percentile is selected due to the increment difference. Next, 95th percentile will replace the remaining outlier.

grocery.max <- as.numeric(quantile(df$Grocery,probs=0.95))
df$Grocery[df$Grocery > grocery.max] <- grocery.max

The same concept will be applied for detergents_paper category.

sort(boxplot.stats(df$Detergents_Paper)$out)

##  [1]  9265  9529  9836 10069 11577 11783 12034 12408 12638 13308 13583 13726
## [13] 14841 18594 18906 19410 24171 26701 40827

quantile(df$Detergents_Paper, probs=seq(from =0.9, to=1,by=0.025))

##      90%    92.5%      95%    97.5%     100% 
##  7243.00  8680.05  9987.45 13629.48 40827.00

grocery.max <- as.numeric(quantile(df$Detergents_Paper,probs=0.925))
df$Detergents_Paper[df$Detergents_Paper > grocery.max] <- grocery.max

For this project, will select the above 2 variables for simplicity.

ggplot(data=df, aes(x=Grocery, y =Detergents_Paper)) + geom_point(shape=1) +geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

Create subset to run k means.

df.subset1<-as.data.frame(df[,c("Grocery","Detergents_Paper")])
summary(df.subset1)

##     Grocery      Detergents_Paper
##  Min.   :  137   Min.   :   3    
##  1st Qu.: 2212   1st Qu.: 287    
##  Median : 4684   Median : 811    
##  Mean   : 7176   Mean   :2280    
##  3rd Qu.:10732   3rd Qu.:3851    
##  Max.   :23140   Max.   :8680

For the case above, normalisation or scaling is required as the values are on different scales of magnitude.

df.subset1<- as.data.frame(scale(df.subset1))
summary(df.subset1)

##     Grocery        Detergents_Paper 
##  Min.   :-1.0898   Min.   :-0.8228  
##  1st Qu.:-0.7685   1st Qu.:-0.7201  
##  Median :-0.3859   Median :-0.5308  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5506   3rd Qu.: 0.5675  
##  Max.   : 2.4715   Max.   : 2.3124

Analysis & Modeling

Tests to find the optimal number of clusters

After scaling is done, following approaches will be run to find the optimal number of clusters: 1. Elbow method 2. Average Silhouette method 3. Gap Statisit method

For illustration purpose, the 3 tests will be run.

set.seed(102)
# Elbow method
fviz_nbclust(df.subset1, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

# Silhouette method
fviz_nbclust(df.subset1, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

# Gap statistic
# nboot = 50 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(df.subset1, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

2 clusters is selected based on the tests (2 out of 3 tests) above. Next, we visualise the clusters with triangle mark as the centroid. ## Kmeans Clustering

set.seed(111)
kmean2.simple <- kmeans(df.subset1,centers=2, iter.max = 25, nstart=100)
df.subset1$cluster <- factor(kmean2.simple$cluster)
summary(df.subset1)

##     Grocery        Detergents_Paper  cluster
##  Min.   :-1.0898   Min.   :-0.8228   1: 71  
##  1st Qu.:-0.7685   1st Qu.:-0.7201   2:237  
##  Median :-0.3859   Median :-0.5308          
##  Mean   : 0.0000   Mean   : 0.0000          
##  3rd Qu.: 0.5506   3rd Qu.: 0.5675          
##  Max.   : 2.4715   Max.   : 2.3124

ggplot(data=df.subset1, aes(x=Detergents_Paper, y=Grocery, colour=cluster))+geom_point()+geom_point(data=as.data.frame(kmean2.simple$centers), color ="black", size=4, shape =17)

## Evaluation & Interpretation Above plot illustrate 2 clusters based on the 2 variables, Grocery and Detergents_paper based on k means algorithm. Next, examine the summary of the k mean object result.

Clustering validation is done next.

Internal cluster validation, which uses the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data. Silhouette width Dunn index 2.External cluster validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. It measures the extent to which cluster labels match externally supplied class labels. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific data set. -Adjusted Rand Measure: varies from -1 (no agreement) to 1 (perfect agreement) -Meila’s variation index VI -Jaccard Index

3.Relative cluster validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s generally used for determining the optimal number of clusters. Elbow method Silhouette method *Gap statistic method ## Internal Cluster Validation

D<- daisy(df.subset1)
plot(silhouette(kmean2.simple$cluster, D),col=1:2, border = NA)

Above plot show the following:

1.Cluster 1: 0.77 (Good as it is close to 1) 2.Cluster 2: 0.88 (Good as it is close to 1) The internal cluster quality evaluation is good as shown in the plot. ## External Cluster Validation

set.seed(111)

library("fpc")
# Compute cluster stats
species <- as.numeric(kmean2.simple$cluster)
clust_stats <- cluster.stats(d = dist(df), 
                             species, kmean2.simple$cluster)
# Corrected Rand index
clust_stats$corrected.rand

## [1] 1

# Meila’s variation index VI
clust_stats$vi

## [1] 0

For illustration sake, above result shows the external cluster validation using Adjusted Rand Index and Meila’s variation Index VI. Do note there is no external class label used as we do not know the result beforehand. As such, findings is perfect result (impossible in real world).

The corrected Rand index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement).

Rand index: 1 (towards 1 better) VI: 0 (lower better)

set.seed(111)
kmean2.simple

## K-means clustering with 2 clusters of sizes 71, 237
## 
## Cluster means:
##      Grocery Detergents_Paper
## 1  1.5282872        1.5779677
## 2 -0.4578413       -0.4727245
## 
## Clustering vector:
##   [1] 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2
##  [38] 2 2 1 1 2 2 1 2 1 2 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 2
##  [75] 1 2 1 2 1 1 2 2 2 2 1 2 1 2 2 2 2 2 1 1 2 2 1 2 2 1 2 1 2 2 2 2 2 1 2 2 1
## [112] 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 2 2 2 1 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2
## [149] 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 2 2 2 1
## [186] 2 2 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1
## [260] 2 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2
## [297] 2 2 2 2 2 1 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 74.39923 94.33843
##  (between_SS / total_SS =  72.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

## checking betweenss i.e. the inter cluster distance between cluster
kmean2.simple$betweenss

## [1] 445.2623

Findings

Summary findings on the clusters:

1.74 and 234 observations respectively. A ration of 30% to 70%. 2.Cluster 1 price average is higher than cluster 2 based on the centroid mean. 3.Intra cluster bond strength factor Cluster 1: 72.88027 Cluster 2: 81.29706 4.Goodness of the classification k-means: 74.9% (good fit) 5.Between Clusters: 459.8227