1. Business question
Goal:
- We will try to identify the trends of customer purchases and see some tendencies that can be used as a marketing strategy on which group of items are frequently purchased together.
What we will do:
- We will use K-Means and PCA to segment customer purchase groups
Columns description:
Channel: Horeca (1)/Retail (2)Region:Branch location of wholesalersFresh: quantity of purchase of fresh goodsMilk: quantity of purchase of milkGrocery: quantity of purchase of groceriesFrozen: quantity of purchase of frozen foodDetergents_Paper: quantity of purchase of detergents and papersDelicassen: quantity of purchase of delicassen
2. Data Preparation
Import Library
Read Data
wholesale <- read.csv("wholesale.csv")
head(wholesale, 5)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 4 1 3 13265 1196 4221 6404 507 1788
## 5 2 3 22615 5410 7198 3915 1777 5185
This datasets contains purchases of customers of different types of products
Take a Peek at the Data
glimpse(wholesale)
## Rows: 440
## Columns: 8
## $ Channel <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…
Insights: - Since we are trying to find the purchase characteristics, Channel and Region will not influence them. - All the columns are in int
3. Data Wrangling
3.1. Remove Channel and Region columns
wholesale_clean <- wholesale %>%
select(-c(Channel, Region))
4. EDA
4.1. Check for null values
wholesale %>%
is.na() %>%
colSums()
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
Insights: no null values on every column
4.2. Check data scale (and tipe data harus numerik) & scale it if necessary
summary(wholesale_clean)
## Fresh Milk Grocery Frozen
## Min. : 3 Min. : 55 Min. : 3 Min. : 25.0
## 1st Qu.: 3128 1st Qu.: 1533 1st Qu.: 2153 1st Qu.: 742.2
## Median : 8504 Median : 3627 Median : 4756 Median : 1526.0
## Mean : 12000 Mean : 5796 Mean : 7951 Mean : 3071.9
## 3rd Qu.: 16934 3rd Qu.: 7190 3rd Qu.:10656 3rd Qu.: 3554.2
## Max. :112151 Max. :73498 Max. :92780 Max. :60869.0
## Detergents_Paper Delicassen
## Min. : 3.0 Min. : 3.0
## 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 816.5 Median : 965.5
## Mean : 2881.5 Mean : 1524.9
## 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :40827.0 Max. :47943.0
Insights: the numeric columns need to be scaled
wholesale_scale <- scale(wholesale_clean)
4.3. Explore the clustering opportunity between valence and all the clusters
ggpairs(wholesale_clean[,c(1:6)], showStrips = F) +
theme(axis.text = element_text(colour = "black", size = 11),
strip.background = element_rect(fill = "#d63d2d"),
strip.text = element_text(colour = "white", size = 12,
face = "bold"))
Insights: * There is a high positive correlation between Detergents_Paper and Grocery (0.925), * There are also a slightly high positive correlations between Grocery and Milk (0.728) and Detergents_Paper and Milk (0.662) - This means that for those item groups who are highly correlated can be bundled together to make new marketing strategies * Low correlations happen between Delicassen and Detergents_Paper (0.069) and Grocery and Fresh (-0.012)
5. Data Preprocessing
5.1. Make PCA
# buat PCA
wholesale_pca <- PCA(wholesale,
scale.unit = T,
graph = F,
ncp = 6)
5.2. PCA visualisation
plot.PCA(wholesale_pca,
choix = "ind", # plot sebaran data
select = "contrib 20")
Insights: there is only 18 outliers out of the 20 that is requested on. We can get rid of them
outlier <- c(24, 48, 57, 62, 66, 86, 87, 88, 93, 94, 104, 126, 182, 184, 212, 252, 326, 334)
wholesale_clean <- wholesale_clean[-outlier,]
5.3. Check the performance
wholesale_pca$eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.10000983 38.7501229 38.75012
## comp 2 1.78996704 22.3745880 61.12471
## comp 3 1.01177388 12.6471735 73.77188
## comp 4 0.73839230 9.2299037 83.00179
## comp 5 0.55663240 6.9579050 89.95969
## comp 6 0.45930835 5.7413544 95.70105
## comp 7 0.28112605 3.5140757 99.21512
## comp 8 0.06279015 0.7848769 100.00000
Insight: we need to use up to PC 4 to get momre than 75% of information
6. Find optimum K
Before we do clustering, first we need to determine the optimal number of clusters. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.
6.1. Elbow Method
Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
fviz_nbclust(wholesale_clean, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9),
big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")
Insight: use k = 5 since the gradient is quite low
6.2. Silhoutte Method
The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).
fviz_nbclust(wholesale_clean, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")
Insight: Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 2. # 6.3. Gap Statistic The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.
fviz_nbclust(wholesale_clean, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")
Insight: Based on the gap statistic method, the graph shows that the optimum number of cluster is 1.
Since there is no majority vote, I decide to use k = 5 since 1 and 2 are too low to make segments/clusters to be analyzed.
7. Build cluster
RNGkind(sample.kind = "Rounding")
set.seed(100)
ws_kmeans <- kmeans(x = wholesale_scale, centers = 5)
8. Cluster Profiling to understand each of the cluster characteristics
#Returning the cluster labels to the original data
wholesale$cluster <- ws_kmeans$cluster
#Cluster profiling
wholesale %>%
group_by(cluster) %>%
summarise_all(mean)
## # A tibble: 5 × 9
## cluster Channel Region Fresh Milk Grocery Frozen Detergents_Paper
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.14 2.52 9023. 2978. 3808. 1939. 1006.
## 2 2 1.09 2.63 16522. 4313. 4314. 13034. 513.
## 3 3 1.15 2.66 40004. 4787. 6419. 4644. 1168.
## 4 4 1.92 2.58 18572. 35622. 43963. 6020. 21111.
## 5 5 1.91 2.52 5509. 10556. 16478. 1420. 7199.
## # … with 1 more variable: Delicassen <dbl>
Insights: * Cluster 1 contains customers with low purchase of Delicassen and Detergent_Paper products * Cluster 2 contains customers with low purchase of Detergent_Paper products * Cluster 3 contains customers that made high purchase of Fresh products and a good amount of other categories * Cluster 4 contains high profile customers that make big purchases of all categories except for Frozen * Cluster 5 contains customers who have high purchase of Milk and Grocery
9. Cluster Visualisation
fviz_cluster(object = ws_kmeans , data = wholesale, labelsize = 0) + theme_minimal()
10. Conclusions
We can pull some conclusions regarding our dataset based on the previous cluster and principle component analysis: * We can separate our customers into at least 5 different segments * Cluster 4 contains customers with high SES with the indication of high amount of purchases on all categories in 1 transaction * There is no need to do dimensionalilty reduction on this dataset since all the data and columns are needed * The clusters are well separated from each other except for cluster 3