Source Dataset :https://www.kaggle.com/binovi/wholesale-customers-data-set

1 Import Library

library(tidyverse)
library(ggplot2)
library(ggpubr)
library(FactoMineR)
library(factoextra)
library(scales)
library(GGally)

2 Read Data

wholesale <- read.csv("wholesale.csv")
glimpse(wholesale)

## Rows: 440
## Columns: 8
## $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

Description :

FRESH: annual spending (m.u.) on fresh products (Continuous).
MILK: annual spending (m.u.) on milk products (Continuous).
GROCERY: annual spending (m.u.)on grocery products (Continuous).
FROZEN: annual spending (m.u.)on frozen products (Continuous).
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous).
DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous).
CHANNEL: Channel - 1 = Horeca (Hotel/Restaurant/Cafe) or 2 = Retail channel (Nominal).
REGION : 1 = Lisbon / 2 = Porto / 3 = other region.

Check first 5 rows

head(wholesale)

Check Missing Value

colSums(is.na(wholesale))

##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

There is no missing value at our dataset.

3 Data Preprocessing

Change data type

wholesale <- wholesale %>% 
          mutate(Channel = as.factor(Channel),
                 Region = as.factor(Region))

Check Summary

summary(wholesale)

##  Channel Region      Fresh             Milk          Grocery     
##  1:298   1: 77   Min.   :     3   Min.   :   55   Min.   :    3  
##  2:142   2: 47   1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153  
##          3:316   Median :  8504   Median : 3627   Median : 4756  
##                  Mean   : 12000   Mean   : 5796   Mean   : 7951  
##                  3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656  
##                  Max.   :112151   Max.   :73498   Max.   :92780  
##      Frozen        Detergents_Paper    Delicassen     
##  Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

Data Visualization

ggplot(gather(wholesale %>% select_if(is.numeric)), aes(value)) + 
    geom_histogram(bins = 10,fill="firebrick") + 
    facet_wrap(~key, scales = 'free_x',nrow=3) +
  theme_bw()

We can see based on summary and data visualization. We found that our max value for each column is higher from the mean and median value. We can conclude that the data have outlier. We will check later using PCA to found the outliers.

4 Modelling

4.1 PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

quanti <- wholesale %>% 
  select_if(is.numeric) %>% 
  colnames()

quantivar <- which(colnames(wholesale) %in% quanti)

quali <- wholesale %>% 
  select_if(is.factor) %>% 
  colnames()

qualivar <- which(colnames(wholesale) %in% quali)

FactoMineR package for function PCA()

wholesale_pca <- PCA(wholesale, 
                scale.unit = T, #scaling
                quali.sup = qualivar, #data categorical
                graph = F, # not showing graph
                ncp = 5)

wholesale_pca$eig

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.64497357              44.082893                          44.08289
## comp 2 1.70258397              28.376400                          72.45929
## comp 3 0.74006477              12.334413                          84.79371
## comp 4 0.56373023               9.395504                          94.18921
## comp 5 0.28567634               4.761272                          98.95048
## comp 6 0.06297111               1.049519                         100.00000

Let’s visualize the percentage of variances captured by each dimensions.

fviz_eig(wholesale_pca, ncp = 6, 
         addlabels = T, main = "Variance explained by each dimensions")

Check Summary

summary(wholesale_pca)

## 
## Call:
## PCA(X = wholesale, scale.unit = T, ncp = 5, quali.sup = qualivar,  
##      graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               2.645   1.703   0.740   0.564   0.286   0.063
## % of var.             44.083  28.376  12.334   9.396   4.761   1.050
## Cumulative % of var.  44.083  72.459  84.794  94.189  98.950 100.000
## 
## Individuals (the 10 first)
##                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                |  0.795 |  0.193  0.003  0.059 | -0.305  0.012  0.147 |
## 2                |  0.753 |  0.434  0.016  0.333 | -0.328  0.014  0.190 |
## 3                |  2.332 |  0.811  0.057  0.121 |  0.815  0.089  0.122 |
## 4                |  1.133 | -0.779  0.052  0.472 |  0.653  0.057  0.332 |
## 5                |  1.577 |  0.166  0.002  0.011 |  1.271  0.216  0.650 |
## 6                |  0.736 | -0.156  0.002  0.045 | -0.295  0.012  0.161 |
## 7                |  0.738 | -0.335  0.010  0.206 | -0.525  0.037  0.506 |
## 8                |  0.623 |  0.141  0.002  0.051 | -0.231  0.007  0.137 |
## 9                |  0.884 | -0.517  0.023  0.343 | -0.659  0.058  0.557 |
## 10               |  1.782 |  1.592  0.218  0.799 | -0.741  0.073  0.173 |
##                   Dim.3    ctr   cos2  
## 1                 0.141  0.006  0.031 |
## 2                -0.319  0.031  0.179 |
## 3                -1.523  0.713  0.427 |
## 4                -0.163  0.008  0.021 |
## 5                -0.066  0.001  0.002 |
## 6                -0.148  0.007  0.040 |
## 7                 0.303  0.028  0.169 |
## 8                -0.390  0.047  0.392 |
## 9                -0.183  0.010  0.043 |
## 10               -0.210  0.014  0.014 |
## 
## Variables
##                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Fresh            |  0.070  0.184  0.005 |  0.689 27.871  0.475 |  0.699 65.976
## Milk             |  0.887 29.715  0.786 |  0.109  0.692  0.012 | -0.052  0.365
## Grocery          |  0.942 33.554  0.887 | -0.191  2.134  0.036 |  0.093  1.175
## Frozen           |  0.083  0.262  0.007 |  0.798 37.366  0.636 | -0.153  3.182
## Detergents_Paper |  0.892 30.101  0.796 | -0.333  6.514  0.111 |  0.117  1.855
## Delicassen       |  0.404  6.184  0.164 |  0.658 25.422  0.433 | -0.451 27.448
##                    cos2  
## Fresh             0.488 |
## Milk              0.003 |
## Grocery           0.009 |
## Frozen            0.024 |
## Detergents_Paper  0.014 |
## Delicassen        0.203 |
## 
## Supplementary categories
##                       Dist     Dim.1    cos2  v.test     Dim.2    cos2  v.test
## Channel_1        |   0.711 |  -0.655   0.850 -12.227 |   0.274   0.149   6.383
## Channel_2        |   1.491 |   1.375   0.850  12.227 |  -0.576   0.149  -6.383
## Region_1         |   0.128 |  -0.102   0.632  -0.604 |  -0.060   0.218  -0.442
## Region_2         |   0.375 |   0.089   0.056   0.395 |  -0.102   0.073  -0.564
## Region_3         |   0.067 |   0.012   0.030   0.238 |   0.030   0.198   0.760
##                      Dim.3    cos2  v.test  
## Channel_1        |   0.004   0.000   0.144 |
## Channel_2        |  -0.009   0.000  -0.144 |
## Region_1         |  -0.034   0.070  -0.380 |
## Region_2         |  -0.060   0.026  -0.509 |
## Region_3         |   0.017   0.067   0.670 |

Through the PCA, I can retain some informative PC (high in cumulative variance) to perform dimension reduction. By doing this, I can reduce the dimension of the variables while also retaining as much information as possible.

Based on Plot and Summary above, I would like to retain at least 80% of the data. Therefore I am going to choose Dim 1-3

4.2 Visualization PCA

4.2.1 Individual Factor Map

We can see outliers by using this plot.

plot.PCA(wholesale_pca, 
         choix = "ind", 
         invisible = "quali", 
         select = "contrib 5", 
         habillage = 1)

Through the individual plot of PCA, dim 1 could cover 44.08% variance of data.

We also found the 5 outlier to be (depends on the menu Channel):

2 from Hotel/Restaurant/Cafe : 184, 326.

3 from Retail : 48, 62, 86.

4.2.2 Variables Factor Map

To represent more than two components tha variables will be positioned inside the circle of correlation. If the variable is closer to the circle (outside), that means the variable can reconstruct it better from the first two components. If the variable is closed to the center of the plot (inside), that means the variable is less important for the two components.

plot.PCA(wholesale_pca,
         choix = "var")

Insight:

PC1 mostly sum of two variables: Grocery, Detergents_Paper.
PC2 mostly sum of two variables : Frozen, Fresh.

fviz_contrib(X = wholesale_pca,
             choice = "var",
             axes = 1)

Grocery, Detergents_Paper, Milk are most items contributed to dimention 1.

fviz_contrib(X = wholesale_pca,
             choice = "var",
             axes = 2)

Frozen, Fresh, Delicassen are most items contributed to dimention 2.

4.3 Clustering

Choose numeric column

wholesale_num <- wholesale %>% select_if(is.numeric)

Scale dataset

wholesale_num_scale <- scale(wholesale_num)

Finding K-Optimum using Elbow Method

Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

fviz_nbclust(wholesale_num_scale, kmeans, method = "wss", k.max = 30)+ labs(subtitle = "Elbow method") + theme_bw()

Using the elbow method, we know that 5 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.

K-Means Clustering

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

# k-means clustering
wholesale_kmeans <- kmeans(wholesale_num_scale, centers = 5)

4.4 Visualization Clustering

fviz_cluster(wholesale_kmeans, data=wholesale_num_scale)

Cluster Profilling

#put label cluster into dataset
wholesale$cluster <- wholesale_kmeans$cluster

wholesale1 <- wholesale[,3:9]

# make profilling with summarise data
wholesale1 %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

Based on result above, below are profiles on each clusters:

Cluster 1 spending highest at Fresh and lowest at Delicassen.
Cluster 2 spending highest at Fresh and Frozen but lowest at Detergents_Paper.
Cluster 3 spending highest at Fresh and lowest at Detergents_Paper.
Cluster 4 spending highest at Grocery and lowest at Frozen.
Cluster 5 spending highest at Grocery and lowest at Delicassen.

5 Conclusion

After exploring our dataset by using PCA and K-Means for clustering, we are able to conclude that:

Wholesale item can be separated into 5 clusters.
We can get available stock at our supplier by looking based on regional or their channel.

Wholesale Unsupervised Learning

Bayu Raka Janasri

6/21/2021