Clustering Wholesale customers Data Set

Picture are taken from Kaggle

Introduction

1.1 Greetings

Hello Everyone.

Welcome to my Rmd.

This is my HTML_Document about Clustering Wholesale customers Data Set.

Hope you can enjoy that!

1.2 What We Will Do

We will learn to use K-Means Clustering and PCA Methods by using Wholesale dataset. We want to know how many clusters are created.

Data Source: https://www.kaggle.com/datasets/binovi/wholesale-customers-data-set

1.3 About Dataset

The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories

Source: UCI Wholesale customers Data Set

1.4 Business Goal

We will try to do a clustering analysis using K-means method. We will also see if we can do a dimensionality reduction using the Principle Components Analysis (PCA).

1.5 Library and Setup

library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(tidyverse)
library(lubridate)
library(cluster)
library(factoextra)
library(ggforce)
library(GGally)
library(scales)
library(cowplot)
library(FactoMineR)
library(factoextra)
library(plotly)

Import Data

The dataset contains 8 columns and 440 rows.

wholesale <- read.csv("data_input/wholesale.csv")
str(wholesale)

## 'data.frame':    440 obs. of  8 variables:
##  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
##  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
##  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
##  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
##  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
##  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
##  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

Data Preprocessing

From 8 columns in the dataset, we have to convert data type of Channel and Region to be factor.

wholesale <- wholesale %>% 
  mutate(Channel = as.factor(Channel), 
         Region = as.factor(Region))
glimpse(wholesale)

## Rows: 440
## Columns: 8
## $ Channel          <fct> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,~
## $ Region           <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,~
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5~
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,~
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,~
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115~
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, ~
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2~

Then we check missing value inside of dataset.

anyNA(wholesale)

## [1] FALSE

colSums(is.na(wholesale))

##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

There are no missing value inside the dataset. So we can continue to next steps.

Exploratory Data Analysis

Possibility for Clustering

One of the easiest way to get clusters is to look Channel. Here we try to see Grocery and Fresh columns.

ggplot(wholesale, aes(Grocery, Fresh, color = Channel)) + 
    geom_point(alpha = 0.5) + theme_minimal()

There are only several differences between Channel 1 and 2 by looking at Grocery and Fresh.

Now we want to check some stats between Channel 1 and 2.

p1 <- ggplot(wholesale, aes(Channel, Fresh, fill = Channel)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Fresh")

p2 <- ggplot(wholesale, aes(Channel, Milk, fill = Channel)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Milk")

p3 <- ggplot(wholesale, aes(Channel, Grocery, fill = Channel)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Grocery")

p4 <- ggplot(wholesale, aes(Channel, Frozen, fill = Channel)) + geom_boxplot() + 
    theme_minimal() + theme(legend.position = "bottom") + labs(title = "Frozen")

p5 <- ggplot(wholesale, aes(Channel, Detergents_Paper, fill = Channel)) + geom_boxplot() + 
    theme_minimal() + theme(legend.position = "bottom") + labs(title = "Detergents_Paper")

p6 <- ggplot(wholesale, aes(Channel, Delicassen, fill = Channel)) + geom_boxplot() + 
    theme_minimal() + theme(legend.position = "bottom") + labs(title = "Delicassen")

plot_grid(p1, p2, p3, p4, p5, p6)

Channel 2 is a better in some aspects except in Fresh and Frozen.

Based on our exploratory process, we can segment or cluster some Products based on their specific traits, such as the Channel or Region. To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means.

Possibility for Principle Component Analysis (PCA)

We want to see if there is a high correlation between numeric variables. Strong correlation in some variables implies that we can reduce the dimensionality or number of features using the Principle Component Analysis (PCA).

df <- wholesale %>% select(-c(Channel, Region))

ggcorr(df, low = "navy", high = "darkred")

There are some features that has high correlation such as Grocery with Detergents_Paper , Milk with Grocery, etc. Base on this result, we will try to reduce the dimension using PCA.

Clustering

Finding Optimal Number of Clusters

Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.

Elbow Method

Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

#elbow method
fviz_nbclust(df, kmeans, method = "wss", k.max = 15) + scale_y_continuous(labels = number_format(scale = 10^(-9), 
    big.mark = ",", suffix = " bil.")) + labs(subtitle = "Elbow method")

Using the elbow method, we know that 5 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.

Silhouette Method

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

#Silhouette
fviz_nbclust(df, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 2.

Gap Statistic

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

#GAP statistic 
fviz_nbclust(df, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")

Every method suggests different number of clusters. We can also try one by one of the numbers to compare the result of cluters but in this case we will use k = 5.

K-Means Clustering

Here is the algorithm behind K-Means Clustering:

Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)

set.seed(123)
km <- kmeans(df, centers = 5)
km

## K-means clustering with 5 clusters of sizes 79, 102, 10, 226, 23
## 
## Cluster means:
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  4667.582 11639.101 18289.785 1525.241        8060.3165  1594.4557
## 2 21337.549  3933.853  5176.069 4182.784        1138.1078  1713.0294
## 3 21263.700 37443.300 46710.600 6287.200       21699.4000  8743.3000
## 4  6143.872  3276.792  4115.164 2438.350        1220.7035   992.5398
## 5 49296.087  4983.783  5590.304 8285.783         962.2609  2543.6957
## 
## Clustering vector:
##   [1] 4 4 4 4 2 4 4 4 4 1 1 4 2 2 2 4 1 4 2 4 2 4 2 3 2 2 4 4 1 5 2 4 2 2 4 4 2
##  [38] 1 1 5 2 2 1 1 4 1 1 3 4 1 4 4 5 1 2 4 1 1 4 4 4 3 4 1 4 3 4 2 4 4 2 2 4 2
##  [75] 4 2 4 1 4 4 4 1 4 2 4 3 3 5 4 2 4 4 3 2 1 4 4 4 4 4 1 1 4 5 4 2 1 1 4 1 4
## [112] 1 2 2 2 4 4 4 2 4 2 4 4 4 5 5 2 2 4 5 4 4 2 4 4 4 4 4 4 4 2 2 5 4 2 1 4 4
## [149] 4 2 2 4 2 4 4 1 1 2 4 1 4 4 2 1 4 1 4 4 4 4 1 1 4 1 4 1 5 4 4 4 4 5 1 3 4
## [186] 4 4 4 1 1 2 4 4 1 4 2 2 4 4 4 1 1 2 4 4 1 4 4 4 1 2 3 4 4 1 1 1 2 1 4 2 4
## [223] 4 4 4 4 2 4 4 4 4 4 2 4 2 4 4 2 4 5 2 2 2 4 4 1 4 4 2 4 4 1 4 2 4 2 4 4 5
## [260] 5 4 4 2 4 1 1 1 2 1 2 4 4 4 5 4 4 2 4 4 2 4 4 5 2 5 5 4 2 2 5 4 4 4 1 2 4
## [297] 2 4 4 4 2 1 4 1 1 1 1 2 4 1 4 2 1 4 4 1 4 4 4 1 4 4 2 4 2 5 4 4 2 4 4 1 2
## [334] 3 2 2 4 4 4 4 4 4 4 1 4 4 1 2 4 1 4 1 4 1 2 4 2 1 4 4 2 4 4 4 4 4 4 4 2 4
## [371] 5 2 4 2 4 4 1 5 4 4 2 2 2 4 1 4 4 2 4 4 4 4 4 2 4 4 1 4 4 4 4 2 2 2 2 4 2
## [408] 1 4 4 4 4 4 4 4 4 1 4 1 4 1 2 2 2 2 4 1 2 4 4 1 4 2 4 2 2 5 1 4 4
## 
## Within cluster sum of squares by cluster:
## [1]  8724419459  8375563097 14108802241 10318245206 11679101316
##  (between_SS / total_SS =  66.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

df_clust <- wholesale %>% 
  bind_cols(cluster = as.factor(km$cluster)) %>% 
    select(cluster, 1:8)

Cluster Analysis

We will do analysis regarding the characteristic of each cluster and see if there is a difference or specific traits on each clusters. Since we have a lot of features (8), we might not be able to explore all of them.

df_clust %>% mutate(cluster = cluster) %>% ggplot(aes(Grocery, Detergents_Paper, 
    color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") + 
    theme_minimal() + theme(legend.position = "top")

## Warning: The concaveman package is required for geom_mark_hull

By the plot we can see if the data is clustered to be 5 clusters. Although the distance of each cluster is close to each other, but we still can see the difference between each cluster.

We might check on each clusters centroid:

df_clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% select(cluster, 
    Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen) %>% mutate_if(is.numeric, 
    .funs = "round", digits = 2)

## # A tibble: 5 x 7
##   cluster  Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
##   <fct>    <dbl>  <dbl>   <dbl>  <dbl>            <dbl>      <dbl>
## 1 1        4668. 11639.  18290.  1525.            8060.      1594.
## 2 2       21338.  3934.   5176.  4183.            1138.      1713.
## 3 3       21264. 37443.  46711.  6287.           21699.      8743.
## 4 4        6144.  3277.   4115.  2438.            1221.       993.
## 5 5       49296.  4984.   5590.  8286.             962.      2544.

From the table above, we can see the average of each varibles from cluster 1 until 5.

The greater average of Fresh is cluster 1 and the weak is cluster 5.
The greater average of Milk is cluster 2 and the weak is cluster 3.
The greater average of Grocery is cluster 2 and the weak is cluster 3.
The greater average of Frozen is cluster 1 and the weak is cluster 5.
The greater average of Detergents_Paper is cluster 2 and the weak is cluster 4.
The greater average of Delicassen is cluster 2 and the weak is cluster 3.

** How is the distribution of kind of Channel? **

df_clust %>% group_by(cluster, Channel) %>% summarise(total = n())

## `summarise()` has grouped output by 'cluster'. You can override using the
## `.groups` argument.

## # A tibble: 10 x 3
## # Groups:   cluster [5]
##    cluster Channel total
##    <fct>   <fct>   <int>
##  1 1       1           7
##  2 1       2          72
##  3 2       1          81
##  4 2       2          21
##  5 3       1           1
##  6 3       2           9
##  7 4       1         188
##  8 4       2          38
##  9 5       1          21
## 10 5       2           2

Cluster 2 has the less number of Channel 1 and Cluster 1 has the less number of Channel 2. Cluster 3 has the greater number of Channel 1 and Cluster 5 has the greater number of Channel 2.

PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

Dimensionality Reduction

Here we will make PCA from the wholesale1 datasets. We will see the eigenvalues and the percentage of variances explained by each dimensions. The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

wholesale1 <- wholesale
wholesale1$cluster <- km$cluster
wholesale_pca <- PCA(wholesale1 %>% select(-c(Channel,Region)), scale.unit = T, ncp = 31, graph = F, 
    quali.sup = 7)

summary(wholesale_pca)

## 
## Call:
## PCA(X = wholesale1 %>% select(-c(Channel, Region)), scale.unit = T,  
##      ncp = 31, quali.sup = 7, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               2.645   1.703   0.740   0.564   0.286   0.063
## % of var.             44.083  28.376  12.334   9.396   4.761   1.050
## Cumulative % of var.  44.083  72.459  84.794  94.189  98.950 100.000
## 
## Individuals (the 10 first)
##                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                |  0.795 |  0.193  0.003  0.059 | -0.305  0.012  0.147 |
## 2                |  0.753 |  0.434  0.016  0.333 | -0.328  0.014  0.190 |
## 3                |  2.332 |  0.811  0.057  0.121 |  0.815  0.089  0.122 |
## 4                |  1.133 | -0.779  0.052  0.472 |  0.653  0.057  0.332 |
## 5                |  1.577 |  0.166  0.002  0.011 |  1.271  0.216  0.650 |
## 6                |  0.736 | -0.156  0.002  0.045 | -0.295  0.012  0.161 |
## 7                |  0.738 | -0.335  0.010  0.206 | -0.525  0.037  0.506 |
## 8                |  0.623 |  0.141  0.002  0.051 | -0.231  0.007  0.137 |
## 9                |  0.884 | -0.517  0.023  0.343 | -0.659  0.058  0.557 |
## 10               |  1.782 |  1.592  0.218  0.799 | -0.741  0.073  0.173 |
##                   Dim.3    ctr   cos2  
## 1                 0.141  0.006  0.031 |
## 2                -0.319  0.031  0.179 |
## 3                -1.523  0.713  0.427 |
## 4                -0.163  0.008  0.021 |
## 5                -0.066  0.001  0.002 |
## 6                -0.148  0.007  0.040 |
## 7                 0.303  0.028  0.169 |
## 8                -0.390  0.047  0.392 |
## 9                -0.183  0.010  0.043 |
## 10               -0.210  0.014  0.014 |
## 
## Variables
##                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Fresh            |  0.070  0.184  0.005 |  0.689 27.871  0.475 |  0.699 65.976
## Milk             |  0.887 29.715  0.786 |  0.109  0.692  0.012 | -0.052  0.365
## Grocery          |  0.942 33.554  0.887 | -0.191  2.134  0.036 |  0.093  1.175
## Frozen           |  0.083  0.262  0.007 |  0.798 37.366  0.636 | -0.153  3.182
## Detergents_Paper |  0.892 30.101  0.796 | -0.333  6.514  0.111 |  0.117  1.855
## Delicassen       |  0.404  6.184  0.164 |  0.658 25.422  0.433 | -0.451 27.448
##                    cos2  
## Fresh             0.488 |
## Milk              0.003 |
## Grocery           0.009 |
## Frozen            0.024 |
## Detergents_Paper  0.014 |
## Delicassen        0.203 |
## 
## Supplementary categories
##                      Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test  
## cluster.1        |  1.854 |  1.624  0.768  9.790 | -0.860  0.215 -6.458 |
## cluster.2        |  0.942 | -0.448  0.226 -3.170 |  0.679  0.520  5.990 |
## cluster.3        |  7.633 |  7.576  0.985 14.884 |  0.836  0.012  2.048 |
## cluster.4        |  0.818 | -0.685  0.702 -9.073 | -0.300  0.135 -4.955 |
## cluster.5        |  3.200 | -0.154  0.002 -0.465 |  2.528  0.624  9.534 |
##                   Dim.3   cos2 v.test  
## cluster.1        -0.209  0.013 -2.384 |
## cluster.2         0.458  0.237  6.131 |
## cluster.3        -0.144  0.000 -0.534 |
## cluster.4        -0.325  0.158 -8.132 |
## cluster.5         1.942  0.368 11.105 |

Let’s visualize the percentage of variances captured by each dimensions.

fviz_eig(wholesale_pca, ncp = 15, addlabels = T, main = "Variance explained by each dimensions")

More than 80% of the variances can be explained by only using the first 3 dimensions, with the first dimensions can explain 44.1% of the total variances.

We can keep around 85% of the information from our data by using only 3 dimensions. This means that we can actually reduce the number of features on our dataset from 6 to just 3numeric features.

We can extract the values of PC1 to PC3 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.

df_pca <- data.frame(wholesale_pca$ind$coord[, 1:3]) %>% bind_cols(cluster = as.factor(km$cluster)) %>% 
    select(cluster, 1:3)
head(df_pca)

##   cluster      Dim.1      Dim.2       Dim.3
## 1       4  0.1932905 -0.3051000  0.14087845
## 2       4  0.4344199 -0.3284126 -0.31900662
## 3       4  0.8111432  0.8150957 -1.52341562
## 4       4 -0.7786478  0.6527537 -0.16301227
## 5       2  0.1662873  1.2714337 -0.06627939
## 6       4 -0.1561699 -0.2951410 -0.14761194

Individual and variable Factor Map

Individual Observations Map

cos2 or the squared cosine value shows the importance of a principal component for a given observation (vector of original variables). The value of cos2 can help find the components that are important to interpret observations.

The individual observations map shows where each of the observations is positioned in term of PC1, PC2 and PC3. Further analysis can be done to check the visualization.

fviz_pca_ind(wholesale_pca, habillage = 7, addEllipses = T)

From the plot, we can see that there are some data of cluster 1, cluster 4 and cluster 5 which are inside of cluster 3 area.

Variable Factor Map

If the observations are represented by their projections, the variables are represented by their correlations. When more than two components are needed to represent the data perfectly, the variables will be positioned inside the circle of correlations. The closer a variable is to the circle of correlations, the better we can reconstruct this variable from the first two components. The closer to the center of the plot a variable is, the less important it is for the first two components.

fviz_pca_var(wholesale_pca, select.var = list(contrib = 6), col.var = "contrib", 
    gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

The plot above shows us that the variables are located inside the circle, meaning that we would need more that two components to represent our data perfectly. The distance between variables and the origin measures the quality of the variables on the factor map. The color indicates the contribution of each variables. The contributions of variables in accounting for the variability in a given principal component are expressed in percentage. Variables that are correlated with PC1 and PC2 are the most important in explaining the variability in the data set. Variables that do not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis.

We can also check the quality of representation or cos2 of each variables. A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle. A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.

fviz_cos2(wholesale_pca, choice = "var", fill = "cos2") + scale_fill_viridis_c(option = "B") + 
    theme(legend.position = "top")

Variables that highly contributed to PC2 are Detergents_Paper and Milk, while the rest of the first 3 variables contribute more toward PC1, especially Grocery, which has the highest contribution toward PC1.

a <- dimdesc(wholesale_pca)
as.data.frame(a[[1]]$quanti)

##                  correlation       p.value
## Grocery            0.9420663 6.584197e-210
## Detergents_Paper   0.8922741 2.332450e-153
## Milk               0.8865463 1.021905e-148
## Delicassen         0.4044408  9.572010e-19
## cluster           -0.4030353  1.292527e-18

Clustering with PCA

PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.

fviz_cluster(object = km, data = df, labelsize = 0) + theme_minimal()

However, same problem like the individual observations map happened: we have to little dimensions to represent our data. On the above visualization, our cluster looks like intersecting each other because we don’t have enough dimensions to represent them.

We may add 1 more dimensions using plotly to see if our clusters is still clumped together.

plot_ly(df_pca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("black", 
    "red", "green", "blue")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"), 
    yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))

Conclusion

We can pull some conclusions regarding our dataset based on the previous cluster and principle component analysis:

We can separate our data into at least 5 clusters based on all of the numerical features, with more than 66% of the total sum of squares come from the distance of observations between clusters.
We can reduce our dimensions from 6 features into just 3 dimensions and still retain more than 80% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.
However, as we have seen, the dimensionality reduction is not enough for us to visualize the clustering of our data, indicated by overlapping of clusters if we only use the first 3 dimensions. Maybe for the next research, we can use the result from the gap statistic method which the result is 7 cluster.