knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

Introduction

We will try to do a clustering analysis using K-means method. We will also see if possiblity we can do a dimensionality reduction using the Principle Components Analysis (PCA).

Data Preparation

Import Library

library(tidyverse)
library(lubridate)
library(cluster)
library(factoextra)
library(ggforce)
library(GGally)
library(scales)
library(cowplot)
library(FactoMineR)
library(plotly)
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(carData)
library(car)
options(scipen = 100, max.print = 101)

Load Dataset

customers <- read.csv("datainput/Mall_Customers.csv")
head(customers)

#>   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1          1   Male  19                 15                     39
#> 2          2   Male  21                 15                     81
#> 3          3 Female  20                 16                      6
#> 4          4 Female  23                 16                     77
#> 5          5 Female  31                 17                     40
#> 6          6 Female  22                 17                     76

Column Description:

CustomerID : Unique customer number
Age : Customers Age
Annual.Income..k : Customers annual income in UDS (k)
Spending.Score..1.100. : Spending score from customers with rage 0-100

Data Wrangling

Check General Data Information

glimpse(customers)

#> Rows: 200
#> Columns: 5
#> $ CustomerID             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
#> $ Gender                 <chr> "Male", "Male", "Female", "Female", "Female", "…
#> $ Age                    <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35,…
#> $ Annual.Income..k..     <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19,…
#> $ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15…

From the dataset above, the data has 5 columns, 200 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.

Data Cleaning

rownames(customers) <- customers$CustomerID  # change the column of CustomerID to rows
head(customers)

#>   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1          1   Male  19                 15                     39
#> 2          2   Male  21                 15                     81
#> 3          3 Female  20                 16                      6
#> 4          4 Female  23                 16                     77
#> 5          5 Female  31                 17                     40
#> 6          6 Female  22                 17                     76

customers_clean <- customers %>% 
  select(-CustomerID) %>% # unselect CustomerID 
  mutate(Gender= as.factor(Gender)) # chage data type from chr to fct 
head(customers_clean)

#>   Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1   Male  19                 15                     39
#> 2   Male  21                 15                     81
#> 3 Female  20                 16                      6
#> 4 Female  23                 16                     77
#> 5 Female  31                 17                     40
#> 6 Female  22                 17                     76

Missing Value

colSums(is.na(customers_clean))

#>                 Gender                    Age     Annual.Income..k.. 
#>                      0                      0                      0 
#> Spending.Score..1.100. 
#>                      0

In the dataset above, has no missing value data in any columns.

Exploratory Data Analysis (EDA)

summary(customers_clean)

#>     Gender         Age        Annual.Income..k.. Spending.Score..1.100.
#>  Female:112   Min.   :18.00   Min.   : 15.00     Min.   : 1.00         
#>  Male  : 88   1st Qu.:28.75   1st Qu.: 41.50     1st Qu.:34.75         
#>               Median :36.00   Median : 61.50     Median :50.00         
#>               Mean   :38.85   Mean   : 60.56     Mean   :50.20         
#>               3rd Qu.:49.00   3rd Qu.: 78.00     3rd Qu.:73.00         
#>               Max.   :70.00   Max.   :137.00     Max.   :99.00

p1 <- ggplot(customers_clean, aes(Gender, Age, fill = Gender)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Age")

p2 <- ggplot(customers_clean, aes(Gender, Annual.Income..k.. , fill = Gender)) + geom_boxplot(show.legend = F) +
    theme_minimal() + labs(title = "Annual.Income ")
p3 <- ggplot(customers_clean, aes(Gender, Spending.Score..1.100. , fill = Gender)) + geom_boxplot(show.legend = F) +
    theme_minimal() + labs(title = "Spending.Score")
plot_grid(p1,p2,p3)

Insight :

Based on our exploratory process, we can segment or cluster the customers based on gender. To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means.

ggpairs(customers_clean[,c(2:4)], showStrips = F) + 
  theme(axis.text = element_text(colour = "black", size = 11),
        strip.background = element_rect(fill = "#d63d2d"),
        strip.text = element_text(colour = "white", size = 12,
                                  face = "bold"))

Insight:

There are some features that has no strong correlation.This result indicates that this dataset has no multicollinearity and might not be suitable for various classification algorithms (which have non-multicollinearity as their assumption).

Principal Component Analysis can be performed for this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.

Data Pre-processing: Scaling

customers_num <- customers_clean %>% 
  select_if(is.numeric)
glimpse(customers_num)

#> Rows: 200
#> Columns: 3
#> $ Age                    <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35,…
#> $ Annual.Income..k..     <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19,…
#> $ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15…

# scaling
customers_scaled <- scale(customers_num)

# chek data range
summary(customers_scaled)

#>       Age          Annual.Income..k.. Spending.Score..1.100.
#>  Min.   :-1.4926   Min.   :-1.73465   Min.   :-1.905240     
#>  1st Qu.:-0.7230   1st Qu.:-0.72569   1st Qu.:-0.598292     
#>  Median :-0.2040   Median : 0.03579   Median :-0.007745     
#>  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000000     
#>  3rd Qu.: 0.7266   3rd Qu.: 0.66401   3rd Qu.: 0.882916     
#>  Max.   : 2.2299   Max.   : 2.91037   Max.   : 1.889750

Check the plot of how much variance each PC explains when using scaled data:

# melihat variance yang dirangkum tiap PC
plot(prcomp(customers_scaled))

note :

Scaling is used when data has different scales. Scaling makes our data normally distributed.

Clustering

Finding Optimal Number of Clusters

Elbow Method

Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

library(factoextra)

fviz_nbclust(
  x = customers_scaled,
  FUNcluster = kmeans,
  method = "wss"
)

Using the elbow method, we know that 8 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.

Silhouette Method

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

fviz_nbclust(
  x = customers_scaled,
  FUNcluster = kmeans,
  method = "silhouette"
)

Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 8.

Gap Statistic Method

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

fviz_nbclust(
  x = customers_scaled,
  FUNcluster = kmeans,
  method = "gap_stat"
)

Based on the gap statistic method, the optimal k is 6.

Two out of three methods suggest that k = 8 is the optimum number of clusters. This can also be an alarm to us that a clustering method may not be very useful.

K-means Clustering

Here is the algorithm behind K-Means Clustering:

Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)

RNGkind(sample.kind = "Rounding")
set.seed(100)

customers_cluster <- kmeans(customers_scaled,centers = 8) #banyaknya sentroid yg diinginkan
customers_cluster

#> K-means clustering with 8 clusters of sizes 31, 20, 20, 21, 10, 27, 39, 32
#> 
#> Cluster means:
#>          Age Annual.Income..k.. Spending.Score..1.100.
#> 1 -1.0284288         -0.3148588             0.03597620
#> 2  0.4688952         -1.3291594            -1.22562679
#> 3  1.8612633         -0.2821275            -0.01355353
#> 4 -0.9676183         -1.3502813             1.15583070
#> 5 -0.8912587          0.5764386            -1.32824641
#> 6  0.4747282          1.1885821            -1.20676658
#> 7 -0.4408110          0.9891010             1.23640011
#> 8  0.5901457         -0.1902742            -0.09245447
#> 
#> Clustering vector:
#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
#>   1   4   2   4   2   4   2   4   2   4   2   4   2   4   2   4   2   4   2   4 
#>  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
#>   2   4   2   4   2   4   2   4   2   4   2   4   2   4   2   4   2   4   2   4 
#>  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
#>   3   4   8   1   2   1   8   1   1   1   8   1   1   3   8   8   8   3   1   8 
#>  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
#>   3   1   3   8   3   1   8   3   1   1   3   8   3   3   3   1   8   8   1   8 
#>  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
#>   3   8   3   8   1   8   8   1   1   8   3   1   8   8   1   1   8   1   8   1 
#> 101 
#>   1 
#>  [ reached getOption("max.print") -- omitted 99 entries ]
#> 
#> Within cluster sum of squares by cluster:
#> [1] 12.116105 19.809395  4.461166  7.591476  4.952009 23.114703 22.362673
#> [8]  9.645699
#>  (between_SS / total_SS =  82.6 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

Result k-means:

1️⃣The number of cluster observation

customers_cluster$size

#> [1] 31 20 20 21 10 27 39 32

note : cluster-1 ada 31 data cluster-2 ada 20 data cluster-3 ada 20 data cluster-4 ada 21 data cluster-5 ada 10 data cluster-6 ada 27 data cluster-7 ada 39 data cluster-8 ada 32 data

2️⃣ The center/centroid position, commonly used for cluster profiling

# letak pusat cluster atau centroid
customers_cluster$centers

#>          Age Annual.Income..k.. Spending.Score..1.100.
#> 1 -1.0284288         -0.3148588             0.03597620
#> 2  0.4688952         -1.3291594            -1.22562679
#> 3  1.8612633         -0.2821275            -0.01355353
#> 4 -0.9676183         -1.3502813             1.15583070
#> 5 -0.8912587          0.5764386            -1.32824641
#> 6  0.4747282          1.1885821            -1.20676658
#> 7 -0.4408110          0.9891010             1.23640011
#> 8  0.5901457         -0.1902742            -0.09245447

The above result represents the average value of each cluster for each of its columns

3️⃣Cluster labeling for each observation

label <- as.data.frame(customers_cluster$cluster)
head(label)

#>   customers_cluster$cluster
#> 1                         1
#> 2                         4
#> 3                         2
#> 4                         4
#> 5                         2
#> 6                         4

The above result represents determining which cluster each column of customers belongs to

4️⃣ The number of iterations of the k-means algorithm until stable clusters are produced.

customers_cluster$iter

#> [1] 3

The above result represents ttaht 3 iterations until achieving stable clusters

Goodness of Fit

Kebaikan hasil clustering dapat dilihat dari 3 nilai:

Within Sum of Squares ($withinss): jumlah jarak kuadrat dari tiap observasi ke centroid tiap cluster.
Between Sum of Squares ($betweenss): jumlah jarak kuadrat terbobot dari tiap centroid ke rata-rata global. Dibobotkan berdasarkan banyaknya observasi pada cluster.
Total Sum of Squares ($totss): jumlah jarak kuadrat dari tiap observasi ke rata-rata global.

1️⃣ Check the value of WSS (Within-Cluster Sum of Squares), which represents the distance of each observation to its respective cluster centroid—lower values indicate better clustering

customers_cluster$withinss

#> [1] 12.116105 19.809395  4.461166  7.591476  4.952009 23.114703 22.362673
#> [8]  9.645699

customers_cluster$tot.withinss

#> [1] 104.0532

2️⃣ Check the value of BSS (Between-Cluster Sum of Squares), which represents the distance of each centroid to the global meang

bss <- customers_cluster$betweenss
bss

#> [1] 492.9468

3️⃣ Check the value of TSS (Total Sum of Squares), which represents the distance of each observation/data point to the global mean

tss <- customers_cluster$totss

4️⃣ Check the ratio of BSS to TSS (both should ideally approach 1)

bss/tss

#> [1] 0.8257065

insight : The clustering is quite good because the BSS/TSS value of 0.82 approaches 1.

Visualisasi Clustering

head(customers_clean)

#>   Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1   Male  19                 15                     39
#> 2   Male  21                 15                     81
#> 3 Female  20                 16                      6
#> 4 Female  23                 16                     77
#> 5 Female  31                 17                     40
#> 6 Female  22                 17                     76

# visualisasi 2 dimensi
fviz_cluster(object = customers_cluster, 
             data = customers_num)

Interpretation: Cluster Profiling

Creating a new column containing label information from the clusters formed using the optimal K.

customers_clean["cluster"] <- as.factor(customers_cluster$cluster)

# Cek head data
head(customers_clean)

#>   Gender Age Annual.Income..k.. Spending.Score..1.100. cluster
#> 1   Male  19                 15                     39       1
#> 2   Male  21                 15                     81       4
#> 3 Female  20                 16                      6       2
#> 4 Female  23                 16                     77       4
#> 5 Female  31                 17                     40       2
#> 6 Female  22                 17                     76       4

# cluster profiling
library(dplyr)
customers_clean %>%
  group_by(cluster) %>% 
  summarise_all(.funs = "mean")

#> # A tibble: 8 × 5
#>   cluster Gender   Age Annual.Income..k.. Spending.Score..1.100.
#>   <fct>    <dbl> <dbl>              <dbl>                  <dbl>
#> 1 1           NA  24.5               52.3                   51.1
#> 2 2           NA  45.4               25.6                   18.6
#> 3 3           NA  64.8               53.2                   49.8
#> 4 4           NA  25.3               25.1                   80.0
#> 5 5           NA  26.4               75.7                   15.9
#> 6 6           NA  45.5               91.8                   19.0
#> 7 7           NA  32.7               86.5                   82.1
#> 8 8           NA  47.1               55.6                   47.8

  # select(-gender)

From the profiling result, we can derive insights:

Cluster 1: Consists of customers characterized by: predominantly Female gender, average age of 20-30 years, low annual income, and low spending score.

Cluster 2: Consists of customers characterized by: predominantly Female gender, average age of 40-50 years, low annual income, and low spending score.

Cluster 3: Consists of customers characterized by: predominantly Male gender, average age over 50 years, high annual income, and high spending score.

Cluster 4: Consists of customers characterized by: predominantly Female gender, average age under 30 years, low annual income, and high spending score.

Cluster 5: Consists of customers characterized by: balanced gender distribution, average age of 20 years, high annual income, and low spending score.

Cluster 6: Consists of customers characterized by: balanced gender distribution, average age over 40 years, high annual income, and low spending score.

Cluster 7: Consists of customers characterized by: balanced gender distribution, average age of 30 years, high annual income, and high spending score.

Cluster 8: Consists of customers characterized by: balanced gender distribution, average age over 40 years, low annual income, and low spending score.

Principal Component Analysis

The first step of this analysis focuses on performing PCA using the prcomp() function. This command allows for centering the data at 0 by shifting variables, adjusting variance to become 1 unit, standardizing the data, which is necessary because variables are measured on different scales. Additionally, eigenvalues are extracted using the get_eigenvalue() function. Eigenvalues measure the amount of variation held by each principal component (PC). They are evaluated to determine the number of principal components that should be considered.

cust.pca <- prcomp(customers_num, scale = TRUE)
print(cust.pca)

#> Standard deviations (1, .., p=3):
#> [1] 1.1523823 0.9996256 0.8202217
#> 
#> Rotation (n x k) = (3 x 3):
#>                                PC1         PC2          PC3
#> Age                     0.70638235  0.03014116 -0.707188441
#> Annual.Income..k..     -0.04802398  0.99883160 -0.005397916
#> Spending.Score..1.100. -0.70619946 -0.03777499 -0.707004506

summary(cust.pca)

#> Importance of components:
#>                           PC1    PC2    PC3
#> Standard deviation     1.1524 0.9996 0.8202
#> Proportion of Variance 0.4427 0.3331 0.2243
#> Cumulative Proportion  0.4427 0.7758 1.0000

eig.val<-get_eigenvalue(cust.pca)
eig.val

#>       eigenvalue variance.percent cumulative.variance.percent
#> Dim.1  1.3279850         44.26617                    44.26617
#> Dim.2  0.9992514         33.30838                    77.57455
#> Dim.3  0.6727636         22.42545                   100.00000

fviz_eig(cust.pca, col.var="blue")

Based on the importance level of the components, it is evident that the first two PCs have the highest values for variance proportion. This statement is also supported by eigenvalue measurements. Large eigenvalues for the first PC and small for the subsequent PCs, indicating that the first PC corresponds to the direction with the maximum variation in the dataset. The sum of all variance.percent provides a total variance of 3. In terms of scatter plot, the first variance.percent explains 44.26% of the variance, while the second explains 33.30%. Therefore, 57.57% of the variance is explained by the first two eigenvalues together, which is a suitable indicator for further analysis.

PCA Results for Variables

The results of PCA can be evaluated by considering variables (Customer_num) and individuals (Customers). First, the extraction of results for variables will be conducted. For this purpose, get_pca_var() is used to provide a matrix list containing all results for active variables (coordinates, correlations between variables and axes, squared cosines, and contributions)

var <- get_pca_var(cust.pca)
var

#> Principal Component Analysis Results for variables
#>  ===================================================
#>   Name       Description                                    
#> 1 "$coord"   "Coordinates for the variables"                
#> 2 "$cor"     "Correlations between variables and dimensions"
#> 3 "$cos2"    "Cos2 for the variables"                       
#> 4 "$contrib" "contributions of the variables"

Cos2 stands for squared cosine (squared coordinate) and relates to the quality of variable representation. Cos2 of variables on all dimensions using the corrplot package is displayed below, along with the bar plot of variable cos2 using the fviz_cos2() function.

head(var$cos2)

#>                              Dim.1        Dim.2         Dim.3
#> Age                    0.662632678 0.0009078095 0.33645951286
#> Annual.Income..k..     0.003062735 0.9969176625 0.00001960264
#> Spending.Score..1.100. 0.662289604 0.0014258818 0.33628451440

library("corrplot")
corrplot(var$cos2, is.corr=FALSE)

fviz_cos2(cust.pca, choice = "var", axes = 1:3)

Note :

In the case of this dataset, with 3 numeric variables available, all variables considered significant contributions.

Additionally, the quality of variable representation can be displayed on a factor map, where cos2 values vary by gradient color. Variables with low cos2 values will be colored “darkorchid4”, moderate cos2 values - “gold”, high cos2 values - “red”. Positively correlated variables are grouped together, while negatively correlated variables are placed on the opposite side of the plot origin. The distance between variables and the origin measures the quality of variables on the factor map. Variables that are far from the origin are well-represented on the factor map.

fviz_pca_var(cust.pca,
             col.var = "cos2", # Color by the quality of representation
             gradient.cols = c("darkorchid4", "gold", "red"),
             repel = TRUE
             )

Insight :

Annual.Income has a very high cos2 value, indicating good variable representation on the principal component. In this case, the variable is positioned close to the circle in the correlation circle. Spending.Score and Age have the lowest cos2 values, indicating that these variables are not fully represented by the principal component.

Contrib” refers to the contribution of variables. The fviz_contrib() function is used to draw a bar plot of variable contributions for the most significant dimensions, namely PC1, PC2, and PC3.

# Contributions of variables to PC1
a<-fviz_contrib(cust.pca, choice = "var", axes = 1)
# Contributions of variables to PC2
b<-fviz_contrib(cust.pca, choice = "var", axes = 2)
# Contributions of variables to PC3
c<-fviz_contrib(cust.pca, choice = "var", axes = 3)
grid.arrange(a,b,c, ncol=2, top='Contribution of the variables to the first two PCs')

Insight :

The red lines in the above graph indicate the expected average contribution. For a particular component, variables with contributions exceeding this threshold are considered important in contributing to that component. It can be observed that all three variables contribute.

PCA Results for Individuals

The results, for individuals (customers), will be extracted using the get_pca_ind() function. Similar to variables, this provides a matrix list containing all results for individuals (coordinates, correlations between individuals and axes, squared cosines, and contributions). For individuals, the analysis will focus on the cos2 and contributions of individuals to all principal components (PC1, PC2, and PC3).

ind <- get_pca_ind(cust.pca)
ind

#> Principal Component Analysis Results for individuals
#>  ===================================================
#>   Name       Description                       
#> 1 "$coord"   "Coordinates for the individuals" 
#> 2 "$cos2"    "Cos2 for the individuals"        
#> 3 "$contrib" "contributions of the individuals"

fviz_pca_ind(cust.pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("darkorchid4", "gold", "red"),
             repel = TRUE
             )

Insight :

For customer numbers colored red they have very high cos2 values, implying good representation of these individuals on the principal components - they are positioned close to the correlation circle. For customer numbers colored darkorchid4 they have the lowest cos2 values, indicating that they are not well-represented by the principal components - they are close to the center of the circle.

# Total contribution on PC1 and PC2
fviz_contrib(cust.pca, choice = "ind", axes = 1:3)

Insight :

The red lines in the above graph indicate the expected average contribution from indiviual customers.

The summary of PCA analysis above for variables (Customer_num) and individuals (customers) is displayed in the correlation plot (autoplot) from the ggfortify package with reference to dimensions 1, 2, and 3

library(ggfortify)
autoplot(cust.pca, loadings=TRUE, loadings.colour='darkorchid4', loadings.label=TRUE, loadings.label.size=3)

Final Results and Analysis

The goal of this project is to distinguish which customers all into each segmentation or cluster. So far, Principal Component Analysis (PCA) has been conducted for all variables and individuals (customers) using the prcomp() calculation, eigenvalue extraction, squared cosines, and contributions.

Here are the comparative results of both visualizations of the “Mall_Customers” segmentation using the K-Means and PCA methods.

kmeans <- fviz_cluster(object = customers_cluster, 
             data = customers_num) #yg factor yaitu kolom cluster dibuang
kmeans

autoplot(cust.pca, data=customers_cluster, colour="cluster")

Conclusion

From the unsupervised learning analysis above, we can summarize that:

K-means clustering can be done using this dataset. Meskipun variabel yang ada hanya 3 variabel didapatkan nilai Goodnes of Fit dengan nilai ‘$totss’ sebesar 82.5 %, namun metode K-means cukup baik dalam membagi segmentasi customers dengan membagi cluster menjadi 5 cluster.
Dalam dataset ini tidak bisa dilakukan dimensionality reduction, hal tersebuat disebabkan karena variable yg sedikit yaitu 3 variabel. Apabila di reduce menjadi 2 variabel, hanya mendapatkan 77 % informasi dari keseluruhan data.
The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.

K-Means Clustering of Mall_Customers

Rastiti Ratnasari

2024-04-05

Introduction

Data Preparation

Import Library

Load Dataset

Data Wrangling

Check General Data Information

Data Cleaning

Missing Value

Exploratory Data Analysis (EDA)

Data Pre-processing: Scaling

Clustering

Finding Optimal Number of Clusters

Elbow Method

Silhouette Method

Gap Statistic Method

K-means Clustering

Goodness of Fit

Visualisasi Clustering

Interpretation: Cluster Profiling

Principal Component Analysis

PCA Results for Variables

PCA Results for Individuals

Final Results and Analysis

Conclusion

Reference