Introduction

This markdown document comprises of one of the 4 data science (DS) projects. Each open dataset is specially selected to showcase different technical aspects of a project and their business value(s).

Primary objectives are to demonstrate following knowledge:

  1. Explore, understand & cleanse data with explained reasoning (Exploratory Data Analysis)

  2. Identify, implement, evaluate the best possible statistics model based on data-based proof

  3. Systematic documentation with reproducible R coding for others to understand & improve

  4. Summarise observed findings with concise business values & insights through executive summary

  5. Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application

What it doesn’t consists

  1. Data product via dashboard: Visualise processed output data through visualisation tools such as Qlik or Tableau - Though am trained in this ;)

  2. Deep business domain knowledge explanation in the dataset

  3. External dataset(s) to enrich or give another dimensions to tackle the business question

  4. Working style or collaboration methods with other data team members (Range from data analyst, business analyst, data scientist, data engineer, visualisation engineer, business stakeholder, project manager)

  5. Other statistics modeling which might be better or can be used as alternative.

Reasons:

  1. There is a limited time & scope in executing this project.
  2. The essential goal is to meet the above primary stated objectives. Not to showcase the breadth of techniques.
  3. Acknowledge that there is always a better or alternative solution to test the stated hypothesis.

All data sources used are taken from public domain and projects are implemented using Rstudio. For full Markdown version, kindly contact me via teochunwey@gmail.com

Content

  1. Overview
  2. Data Dictionary
  3. Exploratory Data Analysis
  4. Analysis & Modeling
  5. Evaluation & Interpretation
  6. Findings & Executive Summary

Overview

For this project, the use case is to find the segment group of wholesale customers based on the item purchased. Dataset used will be wholesale customer dataset from public domain.

This dataset is chosen as clustering customer segment is one of the common topics seen. The clustering of variables/ features using unsupervised learning often allow discovery of inter class high similarity and intra class high dis-similarity. This clustering is easily reproducible in other context beside customer segment. Other domains such as automobile insurance, credit card users, telco customer churn and ecommerce shoppers.

As these modeling are based on historical data with certain assumption. This project did not take into other external factors such as mode of purchase (credit card/ cash), direct delivery or self collect and time of purchase transaction.

Data Dictionary

  1. FRESH: annual spending (m.u.) on fresh products (Continuous)
  2. MILK: annual spending (m.u.) on milk products (Continuous)
  3. GROCERY: annual spending (m.u.)on grocery products (Continuous)
  4. FROZEN: annual spending (m.u.)on frozen products (Continuous)
  5. DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  6. DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous)
  7. CHANNEL: customers Channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
  8. REGION: customers Region - Lisnon, Oporto or Other (Nominal)

Note: Monetary units (m.u.)

Data Source: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers

Exploratory Data Analysis

library(stats)
library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(corrplot)
## corrplot 0.84 loaded
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(NbClust)
library (cluster)
df.original<-read.csv('Wholesale customers data.csv')
set.seed(1) #Ensure reproducable code 
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
head(df)
##     Channel Region Fresh  Milk Grocery Frozen Detergents_Paper Delicassen
## 117       1      3 11173  2521    3355   1517              310        222
## 164       2      3  5531 15726   26870   2367            13726        446
## 251       1      1  3191  1993    1799   1730              234        710
## 397       2      3  4515 11991    9345   2644             3378       2213
## 88        1      3 43265  5025    8117   6312             1579      14351
## 391       1      3  3352  1181    1328   5502              311       1000

Look at the sample records in the dataset to compare against data dictionary.

  1. Ensure data dictionary information matches with the dataset.
  2. Understand the dataset of each column and what the meaning behind them.

There are 6 product categories with annual spending by the wholesale custoemr group by Channel and Region.

str(df)
## 'data.frame':    308 obs. of  8 variables:
##  $ Channel         : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ Region          : int  3 3 1 3 3 3 3 3 1 3 ...
##  $ Fresh           : int  11173 5531 3191 4515 43265 3352 8708 7149 2083 9898 ...
##  $ Milk            : int  2521 15726 1993 11991 5025 1181 3634 2247 5007 961 ...
##  $ Grocery         : int  3355 26870 1799 9345 8117 1328 6100 1242 1563 2861 ...
##  $ Frozen          : int  1517 2367 1730 2644 6312 5502 2349 1619 1120 3151 ...
##  $ Detergents_Paper: int  310 13726 234 3378 1579 311 2123 1226 147 242 ...
##  $ Delicassen      : int  222 446 710 2213 14351 1000 5137 128 1550 833 ...

Read through the data type in data frame(df) to ensure the structure do not need any changes.

summary(df)
##     Channel          Region          Fresh             Milk      
##  Min.   :1.000   Min.   :1.000   Min.   :    18   Min.   :   55  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3094   1st Qu.: 1610  
##  Median :1.000   Median :3.000   Median :  8130   Median : 3684  
##  Mean   :1.331   Mean   :2.513   Mean   : 12059   Mean   : 5716  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16851   3rd Qu.: 7119  
##  Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   33.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2156   1st Qu.:  805.8   1st Qu.:  282.0   1st Qu.:  395.0  
##  Median : 4904   Median : 1619.0   Median :  824.5   Median :  944.5  
##  Mean   : 7729   Mean   : 3089.2   Mean   : 2764.6   Mean   : 1558.2  
##  3rd Qu.:10550   3rd Qu.: 3532.5   3rd Qu.: 4003.2   3rd Qu.: 1820.2  
##  Max.   :59598   Max.   :60869.0   Max.   :26701.0   Max.   :47943.0

A quick summary stats on the variables. From here, we can conclude the following:

  1. Only the food categories make sense in term of summary stats.
  2. There is no need to scale as the unit of measurement are the same as we will be using the product categories for clustering.
summary(is.na(df))
##   Channel          Region          Fresh            Milk        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:308       FALSE:308       FALSE:308       FALSE:308      
##   Grocery          Frozen        Detergents_Paper Delicassen     
##  Mode :logical   Mode :logical   Mode :logical    Mode :logical  
##  FALSE:308       FALSE:308       FALSE:308        FALSE:308

Check if there are any missing values (NA) in each variables. There is a total of 440 obs and all variables shwo false for missing values. We can safely concluded this dataset is quite cleaned. However, if there are NA values, following are the usual approach tackle this problem.

  1. Exlcude them using R code: na.rm=TRUE -> Generally used on vector
  2. Use default values that is pre-defined based on domain knowledge. Usual suspects are 0, -1 or 9999999999
  3. Recode the missing value with a mean value of that variable
  4. Omit the records with missing value: na.omit(df) -> Subset of the dataset
table(df$Channel)
## 
##   1   2 
## 206 102

Next, the frequency of the categorical variable are been observed. There are 298 obs that purchased via Horeca (Hotel/Restaurant/Café) and 142 by retail.

table(df$Region)
## 
##   1   2   3 
##  57  36 215

On Region, there are 77 annual transaction from Lisbon, 47 from Oporto and 316 from other Regions.

df %>% 
  group_by(Channel,Region) %>%                            # multiple group columns
  summarise(total_fresh = sum(df$Fresh), total_Milk = sum(df$Milk), total_Grocery= sum(df$Grocery),total_Frozen=sum(df$Frozen),total_Detergents_Paper=sum(df$Detergents_Paper), total_Delicassen= sum(df$Delicassen))  # multiple summary columns
## # A tibble: 6 x 8
## # Groups:   Channel [?]
##   Channel Region total_fresh total_Milk total_Grocery total_Frozen
##     <int>  <int>       <int>      <int>         <int>        <int>
## 1       1      1     3714211    1760547       2380616       951463
## 2       1      2     3714211    1760547       2380616       951463
## 3       1      3     3714211    1760547       2380616       951463
## 4       2      1     3714211    1760547       2380616       951463
## 5       2      2     3714211    1760547       2380616       951463
## 6       2      3     3714211    1760547       2380616       951463
## # ... with 2 more variables: total_Detergents_Paper <int>,
## #   total_Delicassen <int>

Above show the sum of cost transaction for different product categories. Fresh and Grocery categories are the top sellers.

temp <- reshape(df, direction="long", varying=c("Fresh","Milk","Grocery","Frozen","Detergents_Paper", "Delicassen"), v.names= "Total_price", timevar="Category", time=c("Fresh", "Milk","Grocery","Frozen","Detergents_Paper", "Delicassen"))

ggplot(temp, aes(x=temp$Category, y =temp$Total_price)) +geom_boxplot() +stat_boxplot(geom ='errorbar') + theme(axis.text.x= element_text(angle=90,hjust=1))+ ggtitle("Product Category Distribution")

#if need to facet -> facet_wrap(~Category,ncol=3)

Above plot observed the following findings:

  1. There are quite a few outliers for the above categories. 2 approaches:
    1. Remove outlier and run k mean algorithm
    2. Without removing outlier, run k medoids -> Partitioning Around Medoids (PAM) algorithm with manhattan distance.
  2. Higher varience in pricing for Fresh and Grocery.

Moving next, we are going to aggregate above plot by year to have a holistic view of the resale transaction.

cor.result<- cor(df)
pairs(df[,-c(1:2)], col=df$Channel, pch=19, lower.panel = NULL) +title(main = "Category by Channel")

## integer(0)
plot(df[,-c(1:2)], col=df$Region, pch=19, lower.panel = NULL)  +title(main = "Category by Region")

## integer(0)
corrplot(cor.result, method="ellipse") +title(main = "Corelation by category")

## numeric(0)

Above plots observed the following findings:

  1. Detergents_Paper and Grocery has the highest correlation.

Next, we check which variables have outliers.

#Remove Region & Channel Columns as they are not sensible
apply(X= df[,-c(1:2)],MARGIN=2, FUN = function(x)length(boxplot.stats(x)$out))
##            Fresh             Milk          Grocery           Frozen 
##               14               20               17               29 
## Detergents_Paper       Delicassen 
##               21               18

From above, the numeric shown are the number of outliers per category. To remove the outliers, Winsorizing technique will be used. In summary, the outliers will be replaced by a specific percentile of the data, normally 90th or 95th. First, we sort the value out for each category.

sort(boxplot.stats(df$Grocery)$out)
##  [1] 23596 23998 24708 24773 25957 26839 26866 26870 28540 28921 28986
## [12] 32114 33586 34792 36486 39694 59598

Next, examine the 90th, 95th, … percentile

quantile(df$Grocery, probs=seq(from =0.9, to=1,by=0.025))
##      90%    92.5%      95%    97.5%     100% 
## 19258.40 21198.98 23857.30 28663.83 59598.00

From above, 95% percentile is selected due to the increment difference. Next, 95th percentile will replace the remaining outlier.

grocery.max <- as.numeric(quantile(df$Grocery,probs=0.95))
df$Grocery[df$Grocery > grocery.max] <- grocery.max

The same concept will be applied for detergents_paper category.

sort(boxplot.stats(df$Detergents_Paper)$out)
##  [1]  9606  9836  9959 10069 11577 11783 12034 12218 12420 12591 12638
## [12] 13308 13583 13726 14235 17120 17740 18594 19410 20070 26701
quantile(df$Detergents_Paper, probs=seq(from =0.9, to=1,by=0.025))
##       90%     92.5%       95%     97.5%      100% 
##  7464.900  8926.725 11710.900 13629.475 26701.000
grocery.max <- as.numeric(quantile(df$Detergents_Paper,probs=0.925))
df$Detergents_Paper[df$Detergents_Paper > grocery.max] <- grocery.max

For this project, will select the above 2 variables for simplicity.

ggplot(data=df, aes(x=Grocery, y =Detergents_Paper)) + geom_point(shape=1) +geom_smooth(method="lm")

Create subset to run k means.

df.subset1<-as.data.frame(df[,c("Grocery","Detergents_Paper")])
summary(df.subset1)
##     Grocery      Detergents_Paper
##  Min.   :    3   Min.   :   3.0  
##  1st Qu.: 2156   1st Qu.: 282.0  
##  Median : 4904   Median : 824.5  
##  Mean   : 7336   Mean   :2398.7  
##  3rd Qu.:10550   3rd Qu.:4003.2  
##  Max.   :23857   Max.   :8926.7

For the case above, normalisation or scaling is required as the values are on different scales of magnitude.

df.subset1<- as.data.frame(scale(df.subset1))
summary(df.subset1)
##     Grocery        Detergents_Paper 
##  Min.   :-1.0922   Min.   :-0.8308  
##  1st Qu.:-0.7715   1st Qu.:-0.7341  
##  Median :-0.3624   Median :-0.5459  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4786   3rd Qu.: 0.5565  
##  Max.   : 2.4606   Max.   : 2.2640

Analysis & Modeling

Tests to find the optimal number of clusters

After scaling is done, following approaches will be run to find the optimal number of clusters: 1. Elbow method 2. Average Silhouette method 3. Gap Statisit method

For illustration purpose, the 3 tests will be run.

set.seed(102)
# Elbow method
fviz_nbclust(df.subset1, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

# Silhouette method
fviz_nbclust(df.subset1, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

# Gap statistic
# nboot = 50 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(df.subset1, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

2 clusters is selected based on the tests (2 out of 3 tests) above. Next, we visualise the clusters with triangle mark as the centroid.

Kmeans Clustering

set.seed(111)
kmean2.simple <- kmeans(df.subset1,centers=2, iter.max = 25, nstart=100)
df.subset1$cluster <- factor(kmean2.simple$cluster)
summary(df.subset1)
##     Grocery        Detergents_Paper  cluster
##  Min.   :-1.0922   Min.   :-0.8308   1: 74  
##  1st Qu.:-0.7715   1st Qu.:-0.7341   2:234  
##  Median :-0.3624   Median :-0.5459          
##  Mean   : 0.0000   Mean   : 0.0000          
##  3rd Qu.: 0.4786   3rd Qu.: 0.5565          
##  Max.   : 2.4606   Max.   : 2.2640
ggplot(data=df.subset1, aes(x=Detergents_Paper, y=Grocery, colour=cluster))+geom_point()+geom_point(data=as.data.frame(kmean2.simple$centers), color ="black", size=4, shape =17)

Evaluation & Interpretation

Above plot illustrate 2 clusters based on the 2 variables, Grocery and Detergents_paper based on k means algorithm. Next, examine the summary of the k mean object result.

Clustering validation is done next.

  1. Internal cluster validation, which uses the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data.
  1. External cluster validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. It measures the extent to which cluster labels match externally supplied class labels. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific data set. -Adjusted Rand Measure: varies from -1 (no agreement) to 1 (perfect agreement) -Meila’s variation index VI -Jaccard Index

  2. Relative cluster validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s generally used for determining the optimal number of clusters.

Internal Cluster Validation

D<- daisy(df.subset1)
plot(silhouette(kmean2.simple$cluster, D),col=1:2, border = NA)

Above plot show the following:

  1. Cluster 1: 0.77 (Good as it is close to 1)
  2. Cluster 2: 0.88 (Good as it is close to 1)

The internal cluster quality evaluation is good as shown in the plot.

External Cluster Validation

set.seed(111)

library("fpc")
# Compute cluster stats
species <- as.numeric(kmean2.simple$cluster)
clust_stats <- cluster.stats(d = dist(df), 
                             species, kmean2.simple$cluster)
# Corrected Rand index
clust_stats$corrected.rand
## [1] 1
# Meila’s variation index VI
clust_stats$vi
## [1] 0

For illustration sake, above result shows the external cluster validation using Adjusted Rand Index and Meila’s variation Index VI. Do note there is no external class label used as we do not know the result beforehand. As such, findings is perfect result (impossible in real world).

The corrected Rand index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement).

Rand index: 1 (towards 1 better) VI: 0 (lower better)

set.seed(111)
kmean2.simple
## K-means clustering with 2 clusters of sizes 74, 234
## 
## Cluster means:
##      Grocery Detergents_Paper
## 1  1.5076718        1.5641070
## 2 -0.4767851       -0.4946321
## 
## Clustering vector:
## 117 164 251 397  88 391 410 287 272  27  89  76 295 165 328 212 305 420 
##   2   1   2   2   2   2   2   2   2   2   2   2   2   2   2   1   1   2 
## 161 426 393 430 273  53 112 422   6 158 359 140 198 246 202 429 336 271 
##   1   2   2   2   2   2   1   2   2   2   2   2   1   1   1   2   1   2 
## 321  44 291 427 329 259 312 220 210 398  10 188 288 405 187 406 170  95 
##   2   1   2   1   2   2   2   2   1   2   1   2   2   2   2   2   2   1 
##  28  39 122 199 253 156 347 416 174 126 245  97 179 286  32 325 377 310 
##   2   1   2   2   2   1   1   2   1   2   2   2   2   2   2   2   1   1 
## 128 123 175 326 315 142 282 380 157 256 144 440 270  72 252  43  87  51 
##   2   2   2   2   2   2   2   2   1   2   2   2   2   1   1   1   1   2 
##  84  21 224 424 356 276 360 141 278 207 223 120  92 335 213 355 403 160 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   1 
## 307 383 323 241 118 363  49   5 232  34 358 206 318 159 155  55 239 143 
##   1   2   2   2   2   2   1   2   2   2   1   1   2   2   2   2   2   2 
## 415  65 337 186 374  24  11 434 285 183 171 333 298 153 205 180 395  77 
##   2   2   2   2   2   1   1   2   2   1   1   2   2   2   2   2   2   2 
## 216 134  52 219  31 354 369 162 387 131 322 294 151  22  79  60  80 250 
##   1   2   2   1   2   1   2   2   2   2   2   1   2   2   2   2   2   2 
## 125 217 244 114  18 338 197 418 304 227 230 105 102 238 346 195 319 236 
##   2   1   2   2   2   2   2   2   1   2   2   2   1   2   2   2   2   2 
## 297  50 229 130 225 365 193 184 265 138 178 375  26 268  70 145 431 300 
##   2   1   2   2   2   2   2   1   1   2   2   2   2   2   2   2   2   2 
## 260 189 313 417 367  64 353 384 132 292 281 166 222  23 366 296 311 409 
##   2   1   1   1   2   1   2   2   2   2   2   1   1   2   2   2   2   2 
## 146 389 425 407  58  37  71 111 200 110  56 394 419 181  73 386  78 433 
##   1   2   2   2   1   2   2   2   2   1   2   2   1   2   2   2   1   2 
##  81 327 343 113 352  91  62 116 211  29  82  42 350 248 247 332 376 177 
##   2   2   2   2   1   2   1   2   2   1   1   2   1   2   2   1   2   2 
## 192 203 201 349  46 435  68 388 392  57 399 411 124 432  12 341 107 228 
##   2   2   1   2   1   2   1   2   2   1   2   2   2   2   2   1   1   2 
## 308  19 240 264 279 372 261  30 208 342 371 194 400  59   8 283 231 309 
##   2   2   2   2   2   2   2   2   2   1   2   1   2   2   2   2   2   2 
## 185 330 284  20 348 255  75  99  15 274 381 234 299  14 182 237 306 428 
##   2   2   2   2   1   2   2   2   1   2   2   2   1   1   1   2   2   2 
## 301 109 
##   2   2 
## 
## Within cluster sum of squares by cluster:
## [1] 73.20304 81.10911
##  (between_SS / total_SS =  74.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
## checking betweenss i.e. the inter cluster distance between cluster
kmean2.simple$betweenss
## [1] 459.6878

Findings

Summary findings on the clusters:

  1. 74 and 234 observations respectively. A ration of 30% to 70%.
  2. Cluster 1 price average is higher than cluster 2 based on the centroid mean.
  3. Intra cluster bond strength factor Cluster 1: 72.88027 Cluster 2: 81.29706
  4. Goodness of the classification k-means: 74.9% (good fit)
  5. Between Clusters: 459.8227

Next Step

Will be doing a k means clustering on all product category in the next project using same dataset. This will be followed by using PAM clustering (with no outlier values changed). At the end, there will be a comparison of clusters analysis using cluster.stats().


Reference

Clustering

http://www.sthda.com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-must-know-methods/

http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/#elbow-method

https://en.proft.me/2016/06/18/exploring-k-means-clustering-analysis-r/

https://stackoverflow.com/questions/32570693/make-silhouette-plot-legible-for-k-means

https://www.guru99.com/r-k-means-clustering.html

http://www.learnbymarketing.com/tutorials/k-means-clustering-in-r-example/

https://www.r-bloggers.com/k-means-clustering-in-r/

http://cowlet.org/2013/12/23/understanding-data-science-clustering-with-k-means-in-r.html

https://www.edureka.co/blog/k-means-clustering-algorithm/

http://www.sthda.com/english/wiki/print.php?id=241#clustering-validation-statistics

https://www.edureka.co/blog/clustering-on-bank-data-using-r/