library(tidyverse)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(ggplot2)
library(DALEX)
library(ggpubr)
library(ClusterR)
library(fastDummies)
library(qdapTools)
library(gridExtra)
library(caret)
library(clValid)
library(kableExtra)
library(corrplot)
This dataset has been obtained from UCI ML Repository. https://archive.ics.uci.edu/ml/datasets/Facebook+Live+Sellers+in+Thailand
The dataset consist of posts extracted from the Facebook pages of 10 Thai fashion and cosmetics retail sellers from March 2012, to June 2018. For each Facebook post, the dataset records the resulting engagement metrics comprising shares, comments, and emoji reactions within which traditional “likes”are distinguished from recently introduced emoji reactions, that are “love”, “wow”, “haha”, “sad” and “angry”, (Dehouche 2020).
fb <- read_csv("Data/Live_20210128.csv")
## Rows: 7050 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): status_type, status_published
## dbl (10): status_id, num_reactions, num_comments, num_shares, num_likes, num...
## lgl (4): Column1, Column2, Column3, Column4
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fb
summary(fb)
## status_id status_type status_published num_reactions
## Min. : 1 Length:7050 Length:7050 Min. : 0.0
## 1st Qu.:1763 Class :character Class :character 1st Qu.: 17.0
## Median :3526 Mode :character Mode :character Median : 59.5
## Mean :3526 Mean : 230.1
## 3rd Qu.:5288 3rd Qu.: 219.0
## Max. :7050 Max. :4710.0
## num_comments num_shares num_likes num_loves
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 17.0 1st Qu.: 0.00
## Median : 4.0 Median : 0.00 Median : 58.0 Median : 0.00
## Mean : 224.4 Mean : 40.02 Mean : 215.0 Mean : 12.73
## 3rd Qu.: 23.0 3rd Qu.: 4.00 3rd Qu.: 184.8 3rd Qu.: 3.00
## Max. :20990.0 Max. :3424.00 Max. :4710.0 Max. :657.00
## num_wows num_hahas num_sads num_angrys
## Min. : 0.000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0000 Median : 0.0000 Median : 0.0000
## Mean : 1.289 Mean : 0.6965 Mean : 0.2437 Mean : 0.1132
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :278.000 Max. :157.0000 Max. :51.0000 Max. :31.0000
## Column1 Column2 Column3 Column4
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:7050 NA's:7050 NA's:7050 NA's:7050
##
##
##
##
We have Column1, Column2, Column3, Column4 that have only missing values, so will drop the, also we can Convert status columnn which as character to dummy variables
fb <- fb %>% fastDummies::dummy_cols("status_type", remove_first_dummy = F) %>%
dplyr::select(-c(status_type, status_id, Column1:Column4)) %>% # Drop unnecessary columns
drop_na()
We see that the column status_published is datetime so let’s convert it to time since posted to reflect the duration the the post has existed.
fb <- fb %>%
mutate(status_published = as.POSIXct(status_published,
format = "%m/%d/%Y %H:%M"),
status_published = as.numeric(Sys.time() - status_published)) %>% # Convert status Published date to n days since published
drop_na()
fb
cor_mat <- cor(fb)
corrplot(cor_mat, type = "lower", order = "hclust",
tl.col = "black", tl.cex = 0.5)
Since our data set is fairly large, and rendering the full dataset might take large amount of time, I will use only the 10% of the dataset which corresponds to approximately 700 observations. Also, Scaling the data is often recommended before performing clustering analysis to ensure that all variables are equally weighted in the analysis, reduce the impact of outliers, make the results more interpretable, and meet algorithm requirements. Clustering is a distance-based method that relies on measuring the distances between points, and if the variables have different scales, clustering results may be biased towards variables with larger values. Scaling the data can make the contribution of each variable more comparable, which can help ensure that the clustering results are accurate, robust, and interpretable.
set.seed(24)
selection <- createDataPartition(fb$num_reactions, p=0.10, list=FALSE)
fb_scaled<- center_scale(fb[selection,])
colnames(fb_scaled) <- colnames(fb)
get_clust_tendency(fb_scaled, 2, graph = T, gradient=list(low="red", high="green"))
## $hopkins_stat
## [1] 0.9986341
##
## $plot
The Hopkins statistic is a measure of cluster tendency that assesses the
clustering tendency of a dataset. It compares the distribution of
randomly generated points with that of the actual data, and returns a
value between 0 and 1. A value closer to 0 indicates that the data is
uniformly distributed and therefore not suitable for clustering, while a
value closer to 1 suggests that the data is highly clustered and
well-suited for clustering.
In this case, the Hopkins statistic has a value of 0.99, which is closer to 1 than to 0. This indicates that the dataset has a high tendency towards clustering and may be suitable for cluster analysis.
In the next step it is necessary to obtain the optimal number of clusters for each of partitional clustering method. We will use 3 methods, Kmeans, PAM, and CLARA which is intended for large dataset. The optimal number of clusters will be chosen primarily based on silhouette statistic.
n_k <- 15 # Number of Clusters to consider
f1 <- fviz_nbclust(fb_scaled, FUNcluster = kmeans, method = "silhouette", k.max = n_k) +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(fb_scaled, FUNcluster = cluster::pam, method = "silhouette", k.max = n_k) +
ggtitle("Optimal number of clusters \n PAM")
f3 <- fviz_nbclust(fb_scaled, FUNcluster = cluster::clara, method = "silhouette", k.max = n_k) +
ggtitle("Optimal number of clusters \n CLARA")
grid.arrange(f1, f2, f3, ncol=3)
The results show that for K-means and PAM dividing the data into 9
clusters is the most appropriate, while 12 in Clara.
To confirm the results, it is always good to look at an alternative method. Therefore, I check the stability of the obtained above results by using the WSS statistics.
f1 <- fviz_nbclust(fb_scaled, FUNcluster = kmeans, method = "wss", k.max = n_k) +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(fb_scaled, FUNcluster = cluster::pam, method = "wss", k.max = n_k) +
ggtitle("Optimal number of clusters \n PAM")
f3 <- fviz_nbclust(fb_scaled, FUNcluster = cluster::clara, method = "wss", k.max = n_k) +
ggtitle("Optimal number of clusters \n CLARA")
grid.arrange(f1, f2, f3, ncol=3)
For KMeans it would be worth trying with 2 Clusters as well.
km9 <- eclust(fb_scaled, k=9 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c9 <- fviz_cluster(km9, data=fb_scaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 9 clusters")
s9 <- fviz_silhouette(km9)
## cluster size ave.sil.width
## 1 1 196 0.38
## 2 2 30 0.55
## 3 3 25 0.00
## 4 4 272 0.74
## 5 5 34 0.39
## 6 6 4 -0.04
## 7 7 10 0.87
## 8 8 42 -0.10
## 9 9 93 0.88
grid.arrange(c9, s9, ncol=2)
km7 <- eclust(fb_scaled, k=7 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c7 <- fviz_cluster(km7, data=fb_scaled, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 7 clusters")
s7 <- fviz_silhouette(km7)
## cluster size ave.sil.width
## 1 1 213 0.30
## 2 2 31 0.57
## 3 3 12 -0.04
## 4 4 312 0.68
## 5 5 93 0.88
## 6 6 35 0.00
## 7 7 10 0.87
grid.arrange(c7, s7, ncol=2)
pam9 <- eclust(fb_scaled, k=9 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp9 <- fviz_cluster(pam9, data=fb_scaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 9 clusters")
sp9 <- fviz_silhouette(pam9)
## cluster size ave.sil.width
## 1 1 310 0.66
## 2 2 111 0.60
## 3 3 53 0.26
## 4 4 10 0.87
## 5 5 30 0.55
## 6 6 30 -0.27
## 7 7 35 0.47
## 8 8 33 0.38
## 9 9 94 0.87
grid.arrange(cp9, sp9, ncol=2)
pam7 <- eclust(fb_scaled, k=7 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cp7 <- fviz_cluster(pam7, data=fb_scaled, elipse.type="convex", geom=c("point")) + ggtitle("PAM with 7 clusters")
sp7 <- fviz_silhouette(pam7)
## cluster size ave.sil.width
## 1 1 309 0.69
## 2 2 140 0.57
## 3 3 85 -0.19
## 4 4 10 0.87
## 5 5 30 0.55
## 6 6 38 0.37
## 7 7 94 0.87
grid.arrange(cp7, sp7, ncol=2)
cL12 <- eclust(fb_scaled, k=12 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cCl12 <- fviz_cluster(cL12, data=fb_scaled, elipse.type="convex", geom=c("point")) + ggtitle("Clara with 12 clusters")
sCl12 <- fviz_silhouette(cL12)
## cluster size ave.sil.width
## 1 1 91 -0.14
## 2 2 106 0.69
## 3 3 222 0.76
## 4 4 31 -0.23
## 5 5 54 0.22
## 6 6 10 0.87
## 7 7 30 0.54
## 8 8 35 0.45
## 9 9 30 0.43
## 10 10 93 0.88
## 11 11 3 0.40
## 12 12 1 0.00
grid.arrange(cCl12, sCl12, ncol=2)
cL8 <- eclust(fb_scaled, k=8 , FUNcluster="pam", hc_metric="euclidean", graph=F)
cCl8 <- fviz_cluster(cL8, data=fb_scaled, elipse.type="convex", geom=c("point")) + ggtitle("Clara with 8 clusters")
sCl8 <- fviz_silhouette(cL8)
## cluster size ave.sil.width
## 1 1 309 0.69
## 2 2 108 0.69
## 3 3 87 -0.22
## 4 4 10 0.87
## 5 5 30 0.55
## 6 6 35 0.45
## 7 7 33 0.38
## 8 8 94 0.87
grid.arrange(cCl8, sCl8, ncol=2)
clmethods <- c("kmeans","pam", "clara")
st <- clValid(fb_scaled, nClust=c(7:12),
clMethods=clmethods, maxitems = 1000,
validation="stability", method="complete")
## Warning in clValid(fb_scaled, nClust = c(7:12), clMethods = clmethods, maxitems
## = 1000, : rownames for data not specified, using 1:nrow(data)
optimalScores(st)
APN: Average Path Length to Nearest Neighbors. This is a measure of the average distance between a point and its nearest neighbors in a clustering analysis. APN can be used to assess the compactness of clusters, with smaller values indicating more compact clusters.
AD: Average Distance. This is a measure of the average distance between points within a cluster in a clustering analysis. AD can be used to assess the homogeneity of clusters, with smaller values indicating more homogeneous clusters.
ADM: Average Distance to Medoid. This is a measure of the average distance between points and the medoid (i.e., the most representative point) of their cluster in a clustering analysis. ADM can be used as an alternative to AD for assessing cluster homogeneity.
FOM: Figure of Merit. This is a measure of the separation between clusters in a clustering analysis, calculated as the ratio of the between-cluster sum of squares to the total sum of squares. FOM can be used to assess the quality of cluster separation, with larger values indicating better separation.
PAM algorithm was assessed to be the most consistent one of methods the used for this data based on mentioned four measures. There is no consensus on the number of clusters, however, 7 and 12 seem the most attractive choices.