knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
We will try to do a clustering analysis using K-means method. We will also see if possiblity we can do a dimensionality reduction using the Principle Components Analysis (PCA).
library(tidyverse)
library(lubridate)
library(cluster)
library(factoextra)
library(ggforce)
library(GGally)
library(scales)
library(cowplot)
library(FactoMineR)
library(plotly)
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(carData)
library(car)
options(scipen = 100, max.print = 101)
customers <- read.csv("datainput/Mall_Customers.csv")
head(customers)
#> CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1 1 Male 19 15 39
#> 2 2 Male 21 15 81
#> 3 3 Female 20 16 6
#> 4 4 Female 23 16 77
#> 5 5 Female 31 17 40
#> 6 6 Female 22 17 76
Column Description:
glimpse(customers)
#> Rows: 200
#> Columns: 5
#> $ CustomerID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
#> $ Gender <chr> "Male", "Male", "Female", "Female", "Female", "…
#> $ Age <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35,…
#> $ Annual.Income..k.. <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19,…
#> $ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15…
From the dataset above, the data has 5 columns, 200 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.
rownames(customers) <- customers$CustomerID # change the column of CustomerID to rows
head(customers)
#> CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1 1 Male 19 15 39
#> 2 2 Male 21 15 81
#> 3 3 Female 20 16 6
#> 4 4 Female 23 16 77
#> 5 5 Female 31 17 40
#> 6 6 Female 22 17 76
customers_clean <- customers %>%
select(-CustomerID) %>% # unselect CustomerID
mutate(Gender= as.factor(Gender)) # chage data type from chr to fct
head(customers_clean)
#> Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1 Male 19 15 39
#> 2 Male 21 15 81
#> 3 Female 20 16 6
#> 4 Female 23 16 77
#> 5 Female 31 17 40
#> 6 Female 22 17 76
colSums(is.na(customers_clean))
#> Gender Age Annual.Income..k..
#> 0 0 0
#> Spending.Score..1.100.
#> 0
In the dataset above, has no missing value data in any columns.
summary(customers_clean)
#> Gender Age Annual.Income..k.. Spending.Score..1.100.
#> Female:112 Min. :18.00 Min. : 15.00 Min. : 1.00
#> Male : 88 1st Qu.:28.75 1st Qu.: 41.50 1st Qu.:34.75
#> Median :36.00 Median : 61.50 Median :50.00
#> Mean :38.85 Mean : 60.56 Mean :50.20
#> 3rd Qu.:49.00 3rd Qu.: 78.00 3rd Qu.:73.00
#> Max. :70.00 Max. :137.00 Max. :99.00
p1 <- ggplot(customers_clean, aes(Gender, Age, fill = Gender)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Age")
p2 <- ggplot(customers_clean, aes(Gender, Annual.Income..k.. , fill = Gender)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Annual.Income ")
p3 <- ggplot(customers_clean, aes(Gender, Spending.Score..1.100. , fill = Gender)) + geom_boxplot(show.legend = F) +
theme_minimal() + labs(title = "Spending.Score")
plot_grid(p1,p2,p3)
Insight :
Based on our exploratory process, we can segment or cluster the customers based on gender. To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means.
ggpairs(customers_clean[,c(2:4)], showStrips = F) +
theme(axis.text = element_text(colour = "black", size = 11),
strip.background = element_rect(fill = "#d63d2d"),
strip.text = element_text(colour = "white", size = 12,
face = "bold"))
Insight:
There are some features that has no strong correlation.This result indicates that this dataset has no multicollinearity and might not be suitable for various classification algorithms (which have non-multicollinearity as their assumption).
Principal Component Analysis can be performed for this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.
customers_num <- customers_clean %>%
select_if(is.numeric)
glimpse(customers_num)
#> Rows: 200
#> Columns: 3
#> $ Age <int> 19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, 35,…
#> $ Annual.Income..k.. <int> 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 19,…
#> $ Spending.Score..1.100. <int> 39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99, 15…
# scaling
customers_scaled <- scale(customers_num)
# chek data range
summary(customers_scaled)
#> Age Annual.Income..k.. Spending.Score..1.100.
#> Min. :-1.4926 Min. :-1.73465 Min. :-1.905240
#> 1st Qu.:-0.7230 1st Qu.:-0.72569 1st Qu.:-0.598292
#> Median :-0.2040 Median : 0.03579 Median :-0.007745
#> Mean : 0.0000 Mean : 0.00000 Mean : 0.000000
#> 3rd Qu.: 0.7266 3rd Qu.: 0.66401 3rd Qu.: 0.882916
#> Max. : 2.2299 Max. : 2.91037 Max. : 1.889750
Check the plot of how much variance each PC explains when using scaled data:
# melihat variance yang dirangkum tiap PC
plot(prcomp(customers_scaled))
note :
Scaling is used when data has different scales. Scaling makes our data normally distributed.
Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
library(factoextra)
fviz_nbclust(
x = customers_scaled,
FUNcluster = kmeans,
method = "wss"
)
Using the elbow method, we know that 8 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.
The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).
fviz_nbclust(
x = customers_scaled,
FUNcluster = kmeans,
method = "silhouette"
)
Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 8.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.
fviz_nbclust(
x = customers_scaled,
FUNcluster = kmeans,
method = "gap_stat"
)
Based on the gap statistic method, the optimal k is 6.
Two out of three methods suggest that k = 8 is the optimum number of clusters. This can also be an alarm to us that a clustering method may not be very useful.
Here is the algorithm behind K-Means Clustering:
RNGkind(sample.kind = "Rounding")
set.seed(100)
customers_cluster <- kmeans(customers_scaled,centers = 8) #banyaknya sentroid yg diinginkan
customers_cluster
#> K-means clustering with 8 clusters of sizes 31, 20, 20, 21, 10, 27, 39, 32
#>
#> Cluster means:
#> Age Annual.Income..k.. Spending.Score..1.100.
#> 1 -1.0284288 -0.3148588 0.03597620
#> 2 0.4688952 -1.3291594 -1.22562679
#> 3 1.8612633 -0.2821275 -0.01355353
#> 4 -0.9676183 -1.3502813 1.15583070
#> 5 -0.8912587 0.5764386 -1.32824641
#> 6 0.4747282 1.1885821 -1.20676658
#> 7 -0.4408110 0.9891010 1.23640011
#> 8 0.5901457 -0.1902742 -0.09245447
#>
#> Clustering vector:
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> 1 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4
#> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
#> 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4
#> 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#> 3 4 8 1 2 1 8 1 1 1 8 1 1 3 8 8 8 3 1 8
#> 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
#> 3 1 3 8 3 1 8 3 1 1 3 8 3 3 3 1 8 8 1 8
#> 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
#> 3 8 3 8 1 8 8 1 1 8 3 1 8 8 1 1 8 1 8 1
#> 101
#> 1
#> [ reached getOption("max.print") -- omitted 99 entries ]
#>
#> Within cluster sum of squares by cluster:
#> [1] 12.116105 19.809395 4.461166 7.591476 4.952009 23.114703 22.362673
#> [8] 9.645699
#> (between_SS / total_SS = 82.6 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
#> [6] "betweenss" "size" "iter" "ifault"
Result k-means:
customers_cluster$size
#> [1] 31 20 20 21 10 27 39 32
note : cluster-1 ada 31 data cluster-2 ada 20 data cluster-3 ada 20 data cluster-4 ada 21 data cluster-5 ada 10 data cluster-6 ada 27 data cluster-7 ada 39 data cluster-8 ada 32 data
# letak pusat cluster atau centroid
customers_cluster$centers
#> Age Annual.Income..k.. Spending.Score..1.100.
#> 1 -1.0284288 -0.3148588 0.03597620
#> 2 0.4688952 -1.3291594 -1.22562679
#> 3 1.8612633 -0.2821275 -0.01355353
#> 4 -0.9676183 -1.3502813 1.15583070
#> 5 -0.8912587 0.5764386 -1.32824641
#> 6 0.4747282 1.1885821 -1.20676658
#> 7 -0.4408110 0.9891010 1.23640011
#> 8 0.5901457 -0.1902742 -0.09245447
The above result represents the average value of each cluster for each of its columns
label <- as.data.frame(customers_cluster$cluster)
head(label)
#> customers_cluster$cluster
#> 1 1
#> 2 4
#> 3 2
#> 4 4
#> 5 2
#> 6 4
The above result represents determining which cluster each column of customers belongs to
customers_cluster$iter
#> [1] 3
The above result represents ttaht 3 iterations until achieving stable clusters
Kebaikan hasil clustering dapat dilihat dari 3 nilai:
$withinss): jumlah jarak kuadrat
dari tiap observasi ke centroid tiap cluster.$betweenss): jumlah jarak
kuadrat terbobot dari tiap centroid ke rata-rata global. Dibobotkan
berdasarkan banyaknya observasi pada cluster.$totss): jumlah jarak kuadrat
dari tiap observasi ke rata-rata global.customers_cluster$withinss
#> [1] 12.116105 19.809395 4.461166 7.591476 4.952009 23.114703 22.362673
#> [8] 9.645699
customers_cluster$tot.withinss
#> [1] 104.0532
bss <- customers_cluster$betweenss
bss
#> [1] 492.9468
tss <- customers_cluster$totss
bss/tss
#> [1] 0.8257065
insight : The clustering is quite good because the BSS/TSS value of 0.82 approaches 1.
head(customers_clean)
#> Gender Age Annual.Income..k.. Spending.Score..1.100.
#> 1 Male 19 15 39
#> 2 Male 21 15 81
#> 3 Female 20 16 6
#> 4 Female 23 16 77
#> 5 Female 31 17 40
#> 6 Female 22 17 76
# visualisasi 2 dimensi
fviz_cluster(object = customers_cluster,
data = customers_num)
Creating a new column containing label information from the clusters formed using the optimal K.
customers_clean["cluster"] <- as.factor(customers_cluster$cluster)
# Cek head data
head(customers_clean)
#> Gender Age Annual.Income..k.. Spending.Score..1.100. cluster
#> 1 Male 19 15 39 1
#> 2 Male 21 15 81 4
#> 3 Female 20 16 6 2
#> 4 Female 23 16 77 4
#> 5 Female 31 17 40 2
#> 6 Female 22 17 76 4
# cluster profiling
library(dplyr)
customers_clean %>%
group_by(cluster) %>%
summarise_all(.funs = "mean")
#> # A tibble: 8 × 5
#> cluster Gender Age Annual.Income..k.. Spending.Score..1.100.
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 24.5 52.3 51.1
#> 2 2 NA 45.4 25.6 18.6
#> 3 3 NA 64.8 53.2 49.8
#> 4 4 NA 25.3 25.1 80.0
#> 5 5 NA 26.4 75.7 15.9
#> 6 6 NA 45.5 91.8 19.0
#> 7 7 NA 32.7 86.5 82.1
#> 8 8 NA 47.1 55.6 47.8
# select(-gender)
From the profiling result, we can derive insights:
Cluster 1: Consists of customers characterized by: predominantly Female gender, average age of 20-30 years, low annual income, and low spending score.
Cluster 2: Consists of customers characterized by: predominantly Female gender, average age of 40-50 years, low annual income, and low spending score.
Cluster 3: Consists of customers characterized by: predominantly Male gender, average age over 50 years, high annual income, and high spending score.
Cluster 4: Consists of customers characterized by: predominantly Female gender, average age under 30 years, low annual income, and high spending score.
Cluster 5: Consists of customers characterized by: balanced gender distribution, average age of 20 years, high annual income, and low spending score.
Cluster 6: Consists of customers characterized by: balanced gender distribution, average age over 40 years, high annual income, and low spending score.
Cluster 7: Consists of customers characterized by: balanced gender distribution, average age of 30 years, high annual income, and high spending score.
Cluster 8: Consists of customers characterized by: balanced gender distribution, average age over 40 years, low annual income, and low spending score.
The first step of this analysis focuses on performing PCA using the prcomp() function. This command allows for centering the data at 0 by shifting variables, adjusting variance to become 1 unit, standardizing the data, which is necessary because variables are measured on different scales. Additionally, eigenvalues are extracted using the get_eigenvalue() function. Eigenvalues measure the amount of variation held by each principal component (PC). They are evaluated to determine the number of principal components that should be considered.
cust.pca <- prcomp(customers_num, scale = TRUE)
print(cust.pca)
#> Standard deviations (1, .., p=3):
#> [1] 1.1523823 0.9996256 0.8202217
#>
#> Rotation (n x k) = (3 x 3):
#> PC1 PC2 PC3
#> Age 0.70638235 0.03014116 -0.707188441
#> Annual.Income..k.. -0.04802398 0.99883160 -0.005397916
#> Spending.Score..1.100. -0.70619946 -0.03777499 -0.707004506
summary(cust.pca)
#> Importance of components:
#> PC1 PC2 PC3
#> Standard deviation 1.1524 0.9996 0.8202
#> Proportion of Variance 0.4427 0.3331 0.2243
#> Cumulative Proportion 0.4427 0.7758 1.0000
eig.val<-get_eigenvalue(cust.pca)
eig.val
#> eigenvalue variance.percent cumulative.variance.percent
#> Dim.1 1.3279850 44.26617 44.26617
#> Dim.2 0.9992514 33.30838 77.57455
#> Dim.3 0.6727636 22.42545 100.00000
fviz_eig(cust.pca, col.var="blue")
Based on the importance level of the components, it is evident that the first two PCs have the highest values for variance proportion. This statement is also supported by eigenvalue measurements. Large eigenvalues for the first PC and small for the subsequent PCs, indicating that the first PC corresponds to the direction with the maximum variation in the dataset. The sum of all variance.percent provides a total variance of 3. In terms of scatter plot, the first variance.percent explains 44.26% of the variance, while the second explains 33.30%. Therefore, 57.57% of the variance is explained by the first two eigenvalues together, which is a suitable indicator for further analysis.
The results of PCA can be evaluated by considering variables (Customer_num) and individuals (Customers). First, the extraction of results for variables will be conducted. For this purpose, get_pca_var() is used to provide a matrix list containing all results for active variables (coordinates, correlations between variables and axes, squared cosines, and contributions)
var <- get_pca_var(cust.pca)
var
#> Principal Component Analysis Results for variables
#> ===================================================
#> Name Description
#> 1 "$coord" "Coordinates for the variables"
#> 2 "$cor" "Correlations between variables and dimensions"
#> 3 "$cos2" "Cos2 for the variables"
#> 4 "$contrib" "contributions of the variables"
Cos2 stands for squared cosine (squared coordinate) and relates to the quality of variable representation. Cos2 of variables on all dimensions using the corrplot package is displayed below, along with the bar plot of variable cos2 using the fviz_cos2() function.
head(var$cos2)
#> Dim.1 Dim.2 Dim.3
#> Age 0.662632678 0.0009078095 0.33645951286
#> Annual.Income..k.. 0.003062735 0.9969176625 0.00001960264
#> Spending.Score..1.100. 0.662289604 0.0014258818 0.33628451440
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)
fviz_cos2(cust.pca, choice = "var", axes = 1:3)
Note :
In the case of this dataset, with 3 numeric variables available, all variables considered significant contributions.
Additionally, the quality of variable representation can be displayed on a factor map, where cos2 values vary by gradient color. Variables with low cos2 values will be colored “darkorchid4”, moderate cos2 values - “gold”, high cos2 values - “red”. Positively correlated variables are grouped together, while negatively correlated variables are placed on the opposite side of the plot origin. The distance between variables and the origin measures the quality of variables on the factor map. Variables that are far from the origin are well-represented on the factor map.
fviz_pca_var(cust.pca,
col.var = "cos2", # Color by the quality of representation
gradient.cols = c("darkorchid4", "gold", "red"),
repel = TRUE
)
Insight :
Annual.Income has a very high cos2 value, indicating good variable representation on the principal component. In this case, the variable is positioned close to the circle in the correlation circle. Spending.Score and Age have the lowest cos2 values, indicating that these variables are not fully represented by the principal component.
Contrib” refers to the contribution of variables. The fviz_contrib() function is used to draw a bar plot of variable contributions for the most significant dimensions, namely PC1, PC2, and PC3.
# Contributions of variables to PC1
a<-fviz_contrib(cust.pca, choice = "var", axes = 1)
# Contributions of variables to PC2
b<-fviz_contrib(cust.pca, choice = "var", axes = 2)
# Contributions of variables to PC3
c<-fviz_contrib(cust.pca, choice = "var", axes = 3)
grid.arrange(a,b,c, ncol=2, top='Contribution of the variables to the first two PCs')
Insight :
The red lines in the above graph indicate the expected average contribution. For a particular component, variables with contributions exceeding this threshold are considered important in contributing to that component. It can be observed that all three variables contribute.
The results, for individuals (customers), will be extracted using the get_pca_ind() function. Similar to variables, this provides a matrix list containing all results for individuals (coordinates, correlations between individuals and axes, squared cosines, and contributions). For individuals, the analysis will focus on the cos2 and contributions of individuals to all principal components (PC1, PC2, and PC3).
ind <- get_pca_ind(cust.pca)
ind
#> Principal Component Analysis Results for individuals
#> ===================================================
#> Name Description
#> 1 "$coord" "Coordinates for the individuals"
#> 2 "$cos2" "Cos2 for the individuals"
#> 3 "$contrib" "contributions of the individuals"
fviz_pca_ind(cust.pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("darkorchid4", "gold", "red"),
repel = TRUE
)
Insight :
For customer numbers colored red they have very high cos2 values, implying good representation of these individuals on the principal components - they are positioned close to the correlation circle. For customer numbers colored darkorchid4 they have the lowest cos2 values, indicating that they are not well-represented by the principal components - they are close to the center of the circle.
# Total contribution on PC1 and PC2
fviz_contrib(cust.pca, choice = "ind", axes = 1:3)
Insight :
The red lines in the above graph indicate the expected average contribution from indiviual customers.
The summary of PCA analysis above for variables (Customer_num) and individuals (customers) is displayed in the correlation plot (autoplot) from the ggfortify package with reference to dimensions 1, 2, and 3
library(ggfortify)
autoplot(cust.pca, loadings=TRUE, loadings.colour='darkorchid4', loadings.label=TRUE, loadings.label.size=3)
The goal of this project is to distinguish which customers all into each segmentation or cluster. So far, Principal Component Analysis (PCA) has been conducted for all variables and individuals (customers) using the prcomp() calculation, eigenvalue extraction, squared cosines, and contributions.
Here are the comparative results of both visualizations of the “Mall_Customers” segmentation using the K-Means and PCA methods.
kmeans <- fviz_cluster(object = customers_cluster,
data = customers_num) #yg factor yaitu kolom cluster dibuang
kmeans
autoplot(cust.pca, data=customers_cluster, colour="cluster")
From the unsupervised learning analysis above, we can summarize that:
K-means clustering can be done using this dataset. Meskipun variabel yang ada hanya 3 variabel didapatkan nilai Goodnes of Fit dengan nilai ‘$totss’ sebesar 82.5 %, namun metode K-means cukup baik dalam membagi segmentasi customers dengan membagi cluster menjadi 5 cluster.
Dalam dataset ini tidak bisa dilakukan dimensionality reduction, hal tersebuat disebabkan karena variable yg sedikit yaitu 3 variabel. Apabila di reduce menjadi 2 variabel, hanya mendapatkan 77 % informasi dari keseluruhan data.
The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.