K-Means Model on Identifying Purchase Trends and Tendencies

Johan Setiawan

12/23/2021

1. Business question

Goal:

We will try to identify the trends of customer purchases and see some tendencies that can be used as a marketing strategy on which group of items are frequently purchased together.

What we will do:

We will use K-Means and PCA to segment customer purchase groups

Columns description:

Channel: Horeca (1)/Retail (2)
Region: Branch location of wholesalers
Fresh: quantity of purchase of fresh goods
Milk: quantity of purchase of milk
Grocery: quantity of purchase of groceries
Frozen: quantity of purchase of frozen food
Detergents_Paper: quantity of purchase of detergents and papers
Delicassen: quantity of purchase of delicassen

2. Data Preparation

Import Library

Read Data

wholesale <- read.csv("wholesale.csv")
head(wholesale, 5)

##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       2      3 12669 9656    7561    214             2674       1338
## 2       2      3  7057 9810    9568   1762             3293       1776
## 3       2      3  6353 8808    7684   2405             3516       7844
## 4       1      3 13265 1196    4221   6404              507       1788
## 5       2      3 22615 5410    7198   3915             1777       5185

This datasets contains purchases of customers of different types of products

Take a Peek at the Data

glimpse(wholesale)

## Rows: 440
## Columns: 8
## $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

Insights: - Since we are trying to find the purchase characteristics, Channel and Region will not influence them. - All the columns are in int

3. Data Wrangling

3.1. Remove `Channel` and `Region` columns

wholesale_clean <- wholesale %>% 
  select(-c(Channel, Region))

4. EDA

4.1. Check for null values

wholesale %>% 
  is.na() %>% 
  colSums()

##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

Insights: no null values on every column

4.2. Check data scale (and tipe data harus numerik) & scale it if necessary

summary(wholesale_clean)

##      Fresh             Milk          Grocery          Frozen       
##  Min.   :     3   Min.   :   55   Min.   :    3   Min.   :   25.0  
##  1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153   1st Qu.:  742.2  
##  Median :  8504   Median : 3627   Median : 4756   Median : 1526.0  
##  Mean   : 12000   Mean   : 5796   Mean   : 7951   Mean   : 3071.9  
##  3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656   3rd Qu.: 3554.2  
##  Max.   :112151   Max.   :73498   Max.   :92780   Max.   :60869.0  
##  Detergents_Paper    Delicassen     
##  Min.   :    3.0   Min.   :    3.0  
##  1st Qu.:  256.8   1st Qu.:  408.2  
##  Median :  816.5   Median :  965.5  
##  Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :40827.0   Max.   :47943.0

Insights: the numeric columns need to be scaled

wholesale_scale <- scale(wholesale_clean)

4.3. Explore the clustering opportunity between valence and all the clusters

ggpairs(wholesale_clean[,c(1:6)], showStrips = F) + 
  theme(axis.text = element_text(colour = "black", size = 11),
        strip.background = element_rect(fill = "#d63d2d"),
        strip.text = element_text(colour = "white", size = 12,
                                  face = "bold"))

Insights: * There is a high positive correlation between Detergents_Paper and Grocery (0.925), * There are also a slightly high positive correlations between Grocery and Milk (0.728) and Detergents_Paper and Milk (0.662) - This means that for those item groups who are highly correlated can be bundled together to make new marketing strategies * Low correlations happen between Delicassen and Detergents_Paper (0.069) and Grocery and Fresh (-0.012)

5. Data Preprocessing

5.1. Make PCA

# buat PCA
wholesale_pca <- PCA(wholesale, 
                   scale.unit = T,
                   graph = F,
                   ncp = 6)

5.2. PCA visualisation

plot.PCA(wholesale_pca,
         choix = "ind", # plot sebaran data
         select = "contrib 20")

Insights: there is only 18 outliers out of the 20 that is requested on. We can get rid of them

outlier <- c(24, 48, 57, 62, 66, 86, 87, 88, 93, 94, 104, 126, 182, 184, 212, 252, 326, 334)
wholesale_clean <- wholesale_clean[-outlier,]

5.3. Check the performance

wholesale_pca$eig

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.10000983             38.7501229                          38.75012
## comp 2 1.78996704             22.3745880                          61.12471
## comp 3 1.01177388             12.6471735                          73.77188
## comp 4 0.73839230              9.2299037                          83.00179
## comp 5 0.55663240              6.9579050                          89.95969
## comp 6 0.45930835              5.7413544                          95.70105
## comp 7 0.28112605              3.5140757                          99.21512
## comp 8 0.06279015              0.7848769                         100.00000

Insight: we need to use up to PC 4 to get momre than 75% of information

6. Find optimum K

Before we do clustering, first we need to determine the optimal number of clusters. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.

6.1. Elbow Method

Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

fviz_nbclust(wholesale_clean, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9), 
    big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")

Insight: use k = 5 since the gradient is quite low

6.2. Silhoutte Method

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

fviz_nbclust(wholesale_clean, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

Insight: Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 2. # 6.3. Gap Statistic The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

fviz_nbclust(wholesale_clean, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")

Insight: Based on the gap statistic method, the graph shows that the optimum number of cluster is 1.

Since there is no majority vote, I decide to use k = 5 since 1 and 2 are too low to make segments/clusters to be analyzed.

7. Build cluster

RNGkind(sample.kind = "Rounding")
set.seed(100)
ws_kmeans <- kmeans(x = wholesale_scale, centers = 5)

8. Cluster Profiling to understand each of the cluster characteristics

#Returning the cluster labels to the original data
wholesale$cluster <- ws_kmeans$cluster

#Cluster profiling
wholesale %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 5 × 9
##   cluster Channel Region  Fresh   Milk Grocery Frozen Detergents_Paper
##     <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>            <dbl>
## 1       1    1.14   2.52  9023.  2978.   3808.  1939.            1006.
## 2       2    1.09   2.63 16522.  4313.   4314. 13034.             513.
## 3       3    1.15   2.66 40004.  4787.   6419.  4644.            1168.
## 4       4    1.92   2.58 18572. 35622.  43963.  6020.           21111.
## 5       5    1.91   2.52  5509. 10556.  16478.  1420.            7199.
## # … with 1 more variable: Delicassen <dbl>

Insights: * Cluster 1 contains customers with low purchase of Delicassen and Detergent_Paper products * Cluster 2 contains customers with low purchase of Detergent_Paper products * Cluster 3 contains customers that made high purchase of Fresh products and a good amount of other categories * Cluster 4 contains high profile customers that make big purchases of all categories except for Frozen * Cluster 5 contains customers who have high purchase of Milk and Grocery

9. Cluster Visualisation

fviz_cluster(object = ws_kmeans , data = wholesale, labelsize = 0) + theme_minimal()

10. Conclusions

We can pull some conclusions regarding our dataset based on the previous cluster and principle component analysis: * We can separate our customers into at least 5 different segments * Cluster 4 contains customers with high SES with the indication of high amount of purchases on all categories in 1 transaction * There is no need to do dimensionalilty reduction on this dataset since all the data and columns are needed * The clusters are well separated from each other except for cluster 3