Picture are taken from Kaggle
Hello Everyone.
Welcome to my Rmd.
This is my HTML_Document about Clustering Wholesale customers Data Set.
Hope you can enjoy that!
We will learn to use K-Means Clustering and PCA Methods by using Wholesale dataset. We want to know how many clusters are created.
Data Source: https://www.kaggle.com/datasets/binovi/wholesale-customers-data-set
The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories
Source: UCI Wholesale customers Data Set
We will try to do a clustering analysis using K-means method. We will also see if we can do a dimensionality reduction using the Principle Components Analysis (PCA).
library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(tidyverse)
library(lubridate)
library(cluster)
library(factoextra)
library(ggforce)
library(GGally)
library(scales)
library(cowplot)
library(FactoMineR)
library(factoextra)
library(plotly)
The dataset contains 8 columns and 440 rows.
wholesale <- read.csv("data_input/wholesale.csv")
str(wholesale)
## 'data.frame': 440 obs. of 8 variables:
## $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
## $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
## $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
## $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
## $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
## $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
## $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
From 8 columns in the dataset, we have to convert data type of Channel and Region to be factor.
wholesale <- wholesale %>%
mutate(Channel = as.factor(Channel),
Region = as.factor(Region))
glimpse(wholesale)
## Rows: 440
## Columns: 8
## $ Channel <fct> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
## $ Region <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ Fresh <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
## $ Milk <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
## $ Grocery <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
## $ Frozen <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
## $ Delicassen <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~
Then we check missing value inside of dataset.
anyNA(wholesale)
## [1] FALSE
colSums(is.na(wholesale))
## Channel Region Fresh Milk
## 0 0 0 0
## Grocery Frozen Detergents_Paper Delicassen
## 0 0 0 0
There are no missing value inside the dataset. So we can continue to next steps.
One of the easiest way to get clusters is to look Channel. Here we
try to see Grocery and Fresh columns.
ggplot(wholesale, aes(Grocery, Fresh, color = Channel)) +
geom_point(alpha = 0.5) + theme_minimal()
There are only several differences between Channel 1 and 2 by looking at
Grocery and Fresh.
Now we want to check some stats between Channel 1 and 2.
p1 <- ggplot(wholesale, aes(Channel, Fresh, fill = Channel)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Fresh")
p2 <- ggplot(wholesale, aes(Channel, Milk, fill = Channel)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Milk")
p3 <- ggplot(wholesale, aes(Channel, Grocery, fill = Channel)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Grocery")
p4 <- ggplot(wholesale, aes(Channel, Frozen, fill = Channel)) + geom_boxplot() +
theme_minimal() + theme(legend.position = "bottom") + labs(title = "Frozen")
p5 <- ggplot(wholesale, aes(Channel, Detergents_Paper, fill = Channel)) + geom_boxplot() +
theme_minimal() + theme(legend.position = "bottom") + labs(title = "Detergents_Paper")
p6 <- ggplot(wholesale, aes(Channel, Delicassen, fill = Channel)) + geom_boxplot() +
theme_minimal() + theme(legend.position = "bottom") + labs(title = "Delicassen")
plot_grid(p1, p2, p3, p4, p5, p6)
Channel 2 is a better in some aspects except in Fresh
and Frozen.
Based on our exploratory process, we can segment or cluster some Products based on their specific traits, such as the Channel or Region. To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means.
We want to see if there is a high correlation between numeric variables. Strong correlation in some variables implies that we can reduce the dimensionality or number of features using the Principle Component Analysis (PCA).
df <- wholesale %>% select(-c(Channel, Region))
ggcorr(df, low = "navy", high = "darkred")
There are some features that has high correlation such as
Grocery with Detergents_Paper ,
Milk with Grocery, etc. Base on this result,
we will try to reduce the dimension using PCA.
Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.
Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
#elbow method
fviz_nbclust(df, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9),
big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")
Using the elbow method, we know that 5 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.
The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).
#Silhouette
fviz_nbclust(df, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")
Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 2.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.
#GAP statistic
fviz_nbclust(df, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")
Every method suggests different number of clusters. We can also try one by one of the numbers to compare the result of cluters but in this case we will use k = 5.
Here is the algorithm behind K-Means Clustering:
Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)
set.seed(123)
km <- kmeans(df, centers = 5)
km
## K-means clustering with 5 clusters of sizes 79, 102, 10, 226, 23
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 4667.582 11639.101 18289.785 1525.241 8060.3165 1594.4557
## 2 21337.549 3933.853 5176.069 4182.784 1138.1078 1713.0294
## 3 21263.700 37443.300 46710.600 6287.200 21699.4000 8743.3000
## 4 6143.872 3276.792 4115.164 2438.350 1220.7035 992.5398
## 5 49296.087 4983.783 5590.304 8285.783 962.2609 2543.6957
##
## Clustering vector:
## [1] 4 4 4 4 2 4 4 4 4 1 1 4 2 2 2 4 1 4 2 4 2 4 2 3 2 2 4 4 1 5 2 4 2 2 4 4 2
## [38] 1 1 5 2 2 1 1 4 1 1 3 4 1 4 4 5 1 2 4 1 1 4 4 4 3 4 1 4 3 4 2 4 4 2 2 4 2
## [75] 4 2 4 1 4 4 4 1 4 2 4 3 3 5 4 2 4 4 3 2 1 4 4 4 4 4 1 1 4 5 4 2 1 1 4 1 4
## [112] 1 2 2 2 4 4 4 2 4 2 4 4 4 5 5 2 2 4 5 4 4 2 4 4 4 4 4 4 4 2 2 5 4 2 1 4 4
## [149] 4 2 2 4 2 4 4 1 1 2 4 1 4 4 2 1 4 1 4 4 4 4 1 1 4 1 4 1 5 4 4 4 4 5 1 3 4
## [186] 4 4 4 1 1 2 4 4 1 4 2 2 4 4 4 1 1 2 4 4 1 4 4 4 1 2 3 4 4 1 1 1 2 1 4 2 4
## [223] 4 4 4 4 2 4 4 4 4 4 2 4 2 4 4 2 4 5 2 2 2 4 4 1 4 4 2 4 4 1 4 2 4 2 4 4 5
## [260] 5 4 4 2 4 1 1 1 2 1 2 4 4 4 5 4 4 2 4 4 2 4 4 5 2 5 5 4 2 2 5 4 4 4 1 2 4
## [297] 2 4 4 4 2 1 4 1 1 1 1 2 4 1 4 2 1 4 4 1 4 4 4 1 4 4 2 4 2 5 4 4 2 4 4 1 2
## [334] 3 2 2 4 4 4 4 4 4 4 1 4 4 1 2 4 1 4 1 4 1 2 4 2 1 4 4 2 4 4 4 4 4 4 4 2 4
## [371] 5 2 4 2 4 4 1 5 4 4 2 2 2 4 1 4 4 2 4 4 4 4 4 2 4 4 1 4 4 4 4 2 2 2 2 4 2
## [408] 1 4 4 4 4 4 4 4 4 1 4 1 4 1 2 2 2 2 4 1 2 4 4 1 4 2 4 2 2 5 1 4 4
##
## Within cluster sum of squares by cluster:
## [1] 8724419459 8375563097 14108802241 10318245206 11679101316
## (between_SS / total_SS = 66.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
df_clust <- wholesale %>%
bind_cols(cluster = as.factor(km$cluster)) %>%
select(cluster, 1:8)
We will do analysis regarding the characteristic of each cluster and see if there is a difference or specific traits on each clusters. Since we have a lot of features (8), we might not be able to explore all of them.
df_clust %>% mutate(cluster = cluster) %>% ggplot(aes(Grocery, Detergents_Paper,
color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") +
theme_minimal() + theme(legend.position = "top")
## Warning: The concaveman package is required for geom_mark_hull
By the plot we can see if the data is clustered to be 5 clusters. Although the distance of each cluster is close to each other, but we still can see the difference between each cluster.
We might check on each clusters centroid:
df_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% select(cluster,
Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen) %>% mutate_if(is.numeric,
.funs = "round", digits = 2)
## # A tibble: 5 x 7
## cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4668. 11639. 18290. 1525. 8060. 1594.
## 2 2 21338. 3934. 5176. 4183. 1138. 1713.
## 3 3 21264. 37443. 46711. 6287. 21699. 8743.
## 4 4 6144. 3277. 4115. 2438. 1221. 993.
## 5 5 49296. 4984. 5590. 8286. 962. 2544.
From the table above, we can see the average of each varibles from cluster 1 until 5.
The greater average of Fresh is cluster 1 and the weak is cluster 5.
The greater average of Milk is cluster 2 and the weak is cluster 3.
The greater average of Grocery is cluster 2 and the weak is cluster 3.
The greater average of Frozen is cluster 1 and the weak is cluster 5.
The greater average of Detergents_Paper is cluster 2 and the weak is cluster 4.
The greater average of Delicassen is cluster 2 and the weak is cluster 3.
** How is the distribution of kind of Channel? **
df_clust %>% group_by(cluster, Channel) %>% summarise(total = n())
## `summarise()` has grouped output by 'cluster'. You can override using the
## `.groups` argument.
## # A tibble: 10 x 3
## # Groups: cluster [5]
## cluster Channel total
## <fct> <fct> <int>
## 1 1 1 7
## 2 1 2 72
## 3 2 1 81
## 4 2 2 21
## 5 3 1 1
## 6 3 2 9
## 7 4 1 188
## 8 4 2 38
## 9 5 1 21
## 10 5 2 2
Cluster 2 has the less number of Channel 1 and Cluster 1 has the less number of Channel 2. Cluster 3 has the greater number of Channel 1 and Cluster 5 has the greater number of Channel 2.
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
Here we will make PCA from the wholesale1 datasets. We
will see the eigenvalues and the percentage of variances explained by
each dimensions. The eigenvalues measure the amount of variation
retained by each principal component. Eigenvalues are large for the
first PCs and small for the subsequent PCs. That is, the first PCs
corresponds to the directions with the maximum amount of variation in
the data set.
wholesale1 <- wholesale
wholesale1$cluster <- km$cluster
wholesale_pca <- PCA(wholesale1 %>% select(-c(Channel,Region)), scale.unit = T, ncp = 31, graph = F,
quali.sup = 7)
summary(wholesale_pca)
##
## Call:
## PCA(X = wholesale1 %>% select(-c(Channel, Region)), scale.unit = T,
## ncp = 31, quali.sup = 7, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 2.645 1.703 0.740 0.564 0.286 0.063
## % of var. 44.083 28.376 12.334 9.396 4.761 1.050
## Cumulative % of var. 44.083 72.459 84.794 94.189 98.950 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 0.795 | 0.193 0.003 0.059 | -0.305 0.012 0.147 |
## 2 | 0.753 | 0.434 0.016 0.333 | -0.328 0.014 0.190 |
## 3 | 2.332 | 0.811 0.057 0.121 | 0.815 0.089 0.122 |
## 4 | 1.133 | -0.779 0.052 0.472 | 0.653 0.057 0.332 |
## 5 | 1.577 | 0.166 0.002 0.011 | 1.271 0.216 0.650 |
## 6 | 0.736 | -0.156 0.002 0.045 | -0.295 0.012 0.161 |
## 7 | 0.738 | -0.335 0.010 0.206 | -0.525 0.037 0.506 |
## 8 | 0.623 | 0.141 0.002 0.051 | -0.231 0.007 0.137 |
## 9 | 0.884 | -0.517 0.023 0.343 | -0.659 0.058 0.557 |
## 10 | 1.782 | 1.592 0.218 0.799 | -0.741 0.073 0.173 |
## Dim.3 ctr cos2
## 1 0.141 0.006 0.031 |
## 2 -0.319 0.031 0.179 |
## 3 -1.523 0.713 0.427 |
## 4 -0.163 0.008 0.021 |
## 5 -0.066 0.001 0.002 |
## 6 -0.148 0.007 0.040 |
## 7 0.303 0.028 0.169 |
## 8 -0.390 0.047 0.392 |
## 9 -0.183 0.010 0.043 |
## 10 -0.210 0.014 0.014 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## Fresh | 0.070 0.184 0.005 | 0.689 27.871 0.475 | 0.699 65.976
## Milk | 0.887 29.715 0.786 | 0.109 0.692 0.012 | -0.052 0.365
## Grocery | 0.942 33.554 0.887 | -0.191 2.134 0.036 | 0.093 1.175
## Frozen | 0.083 0.262 0.007 | 0.798 37.366 0.636 | -0.153 3.182
## Detergents_Paper | 0.892 30.101 0.796 | -0.333 6.514 0.111 | 0.117 1.855
## Delicassen | 0.404 6.184 0.164 | 0.658 25.422 0.433 | -0.451 27.448
## cos2
## Fresh 0.488 |
## Milk 0.003 |
## Grocery 0.009 |
## Frozen 0.024 |
## Detergents_Paper 0.014 |
## Delicassen 0.203 |
##
## Supplementary categories
## Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
## cluster.1 | 1.854 | 1.624 0.768 9.790 | -0.860 0.215 -6.458 |
## cluster.2 | 0.942 | -0.448 0.226 -3.170 | 0.679 0.520 5.990 |
## cluster.3 | 7.633 | 7.576 0.985 14.884 | 0.836 0.012 2.048 |
## cluster.4 | 0.818 | -0.685 0.702 -9.073 | -0.300 0.135 -4.955 |
## cluster.5 | 3.200 | -0.154 0.002 -0.465 | 2.528 0.624 9.534 |
## Dim.3 cos2 v.test
## cluster.1 -0.209 0.013 -2.384 |
## cluster.2 0.458 0.237 6.131 |
## cluster.3 -0.144 0.000 -0.534 |
## cluster.4 -0.325 0.158 -8.132 |
## cluster.5 1.942 0.368 11.105 |
Let’s visualize the percentage of variances captured by each dimensions.
fviz_eig(wholesale_pca, ncp = 15, addlabels = T, main = "Variance explained by each dimensions")
More than 80% of the variances can be explained by only using the first 3 dimensions, with the first dimensions can explain 44.1% of the total variances.
We can keep around 85% of the information from our data by using only 3 dimensions. This means that we can actually reduce the number of features on our dataset from 6 to just 3numeric features.
We can extract the values of PC1 to PC3 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.
df_pca <- data.frame(wholesale_pca$ind$coord[, 1:3]) %>% bind_cols(cluster = as.factor(km$cluster)) %>%
select(cluster, 1:3)
head(df_pca)
## cluster Dim.1 Dim.2 Dim.3
## 1 4 0.1932905 -0.3051000 0.14087845
## 2 4 0.4344199 -0.3284126 -0.31900662
## 3 4 0.8111432 0.8150957 -1.52341562
## 4 4 -0.7786478 0.6527537 -0.16301227
## 5 2 0.1662873 1.2714337 -0.06627939
## 6 4 -0.1561699 -0.2951410 -0.14761194
cos2 or the squared cosine value shows the importance of
a principal component for a given observation (vector of original
variables). The value of cos2 can help find the components
that are important to interpret observations.
The individual observations map shows where each of the observations is positioned in term of PC1, PC2 and PC3. Further analysis can be done to check the visualization.
fviz_pca_ind(wholesale_pca, habillage = 7, addEllipses = T)
From the plot, we can see that there are some data of cluster 1, cluster 4 and cluster 5 which are inside of cluster 3 area.
If the observations are represented by their projections, the variables are represented by their correlations. When more than two components are needed to represent the data perfectly, the variables will be positioned inside the circle of correlations. The closer a variable is to the circle of correlations, the better we can reconstruct this variable from the first two components. The closer to the center of the plot a variable is, the less important it is for the first two components.
fviz_pca_var(wholesale_pca, select.var = list(contrib = 6), col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)
The plot above shows us that the variables are located inside the
circle, meaning that we would need more that two components to represent
our data perfectly. The distance between variables and the origin
measures the quality of the variables on the factor map. The color
indicates the contribution of each variables. The contributions of
variables in accounting for the variability in a given principal
component are expressed in percentage. Variables that are correlated
with PC1 and PC2 are the most important in explaining the variability in
the data set. Variables that do not correlated with any PC or correlated
with the last dimensions are variables with low contribution and might
be removed to simplify the overall analysis.
We can also check the quality of representation or cos2 of each variables. A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle. A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.
fviz_cos2(wholesale_pca, choice = "var", fill = "cos2") + scale_fill_viridis_c(option = "B") +
theme(legend.position = "top")
Variables that highly contributed to PC2 are Detergents_Paper and Milk, while the rest of the first 3 variables contribute more toward PC1, especially Grocery, which has the highest contribution toward PC1.
a <- dimdesc(wholesale_pca)
as.data.frame(a[[1]]$quanti)
## correlation p.value
## Grocery 0.9420663 6.584197e-210
## Detergents_Paper 0.8922741 2.332450e-153
## Milk 0.8865463 1.021905e-148
## Delicassen 0.4044408 9.572010e-19
## cluster -0.4030353 1.292527e-18
PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.
fviz_cluster(object = km, data = df, labelsize = 0) + theme_minimal()
However, same problem like the individual observations map happened: we have to little dimensions to represent our data. On the above visualization, our cluster looks like intersecting each other because we don’t have enough dimensions to represent them.
We may add 1 more dimensions using plotly to see if our clusters is still clumped together.
plot_ly(df_pca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("black",
"red", "green", "blue")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"),
yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))
We can pull some conclusions regarding our dataset based on the previous cluster and principle component analysis:
We can separate our data into at least 5 clusters based on all of the numerical features, with more than 66% of the total sum of squares come from the distance of observations between clusters.
We can reduce our dimensions from 6 features into just 3 dimensions and still retain more than 80% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.
However, as we have seen, the dimensionality reduction is not enough for us to visualize the clustering of our data, indicated by overlapping of clusters if we only use the first 3 dimensions. Maybe for the next research, we can use the result from the gap statistic method which the result is 7 cluster.