Clustering is an unsupervised learning algorithm which groups data points into clusters. The set of data points in a group should have similar properties and/or features. There are several types of clustering methods. In this project I will be using K-means clustering which is also one of the most popular clustering algorithm. We identify K clusters of n observations that are grouped to their nearest centroid and each cluster will have a centroid containing the data points or vectors closest to it making centroid the center of each cluster.
More detailed theoretical introduction on K-Means clustering can be found here, https://medium.com/0xcode/the-k-means-clustering-algorithm-intuition-demonstrated-in-r-aa62584a3649.
In this project, I will be using a dataset from UCI Machine Learning Repository. The dataset contains Facebook pages of 10 Thai fashion and cosmetics retail sellers. Posts of a different nature (video, photos, statuses, and links). Engagement metrics consist of comments, shares, and reactions. The dataset has attributes such as status_id, status_type, status_published, num_reactions, num_comments, num_shares, num_likes, num_loves, num_wows, num_hahas, num_sads and num_angrys. The link to data set is https://archive.ics.uci.edu/ml/datasets/Facebook+Live+Sellers+in+Thailand.
# Loading libraries and setting directory
set.seed(100)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(cluster)
## Warning: package 'cluster' was built under R version 4.0.4
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(purrr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.5 v stringr 1.4.0
## v tidyr 1.1.2 v forcats 0.5.1
## v readr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(corrplot)
## corrplot 0.84 loaded
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(clustertend)
library(ClusterR)
## Loading required package: gtools
library(stats)
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
#Loading the dataset
data <- read.csv("Live_20210128.csv", stringsAsFactors = FALSE, header = TRUE)
head(data)
## status_id status_type status_published num_reactions num_comments num_shares
## 1 1 video 4/22/2018 6:00 529 512 262
## 2 2 photo 4/21/2018 22:45 150 0 0
## 3 3 video 4/21/2018 6:17 227 236 57
## 4 4 photo 4/21/2018 2:29 111 0 0
## 5 5 photo 4/18/2018 3:22 213 0 0
## 6 6 photo 4/18/2018 2:14 217 6 0
## num_likes num_loves num_wows num_hahas num_sads num_angrys Column1 Column2
## 1 432 92 3 1 1 0 NA NA
## 2 150 0 0 0 0 0 NA NA
## 3 204 21 1 1 0 0 NA NA
## 4 111 0 0 0 0 0 NA NA
## 5 204 9 0 0 0 0 NA NA
## 6 211 5 1 0 0 0 NA NA
## Column3 Column4
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
#checking duplicates
data[duplicated(data),]
## [1] status_id status_type status_published num_reactions
## [5] num_comments num_shares num_likes num_loves
## [9] num_wows num_hahas num_sads num_angrys
## [13] Column1 Column2 Column3 Column4
## <0 rows> (or 0-length row.names)
# Summaary of dataset
summary(data)
## status_id status_type status_published num_reactions
## Min. : 1 Length:7050 Length:7050 Min. : 0.0
## 1st Qu.:1763 Class :character Class :character 1st Qu.: 17.0
## Median :3526 Mode :character Mode :character Median : 59.5
## Mean :3526 Mean : 230.1
## 3rd Qu.:5288 3rd Qu.: 219.0
## Max. :7050 Max. :4710.0
## num_comments num_shares num_likes num_loves
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 17.0 1st Qu.: 0.00
## Median : 4.0 Median : 0.00 Median : 58.0 Median : 0.00
## Mean : 224.4 Mean : 40.02 Mean : 215.0 Mean : 12.73
## 3rd Qu.: 23.0 3rd Qu.: 4.00 3rd Qu.: 184.8 3rd Qu.: 3.00
## Max. :20990.0 Max. :3424.00 Max. :4710.0 Max. :657.00
## num_wows num_hahas num_sads num_angrys
## Min. : 0.000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0000 Median : 0.0000 Median : 0.0000
## Mean : 1.289 Mean : 0.6965 Mean : 0.2437 Mean : 0.1132
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :278.000 Max. :157.0000 Max. :51.0000 Max. :31.0000
## Column1 Column2 Column3 Column4
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:7050 NA's:7050 NA's:7050 NA's:7050
##
##
##
##
#Checking for NAs
head(data[,colSums(is.na(data)) > 0])
## Column1 Column2 Column3 Column4
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
#remove last 4 columns
data <- data[,1:12]
data$status_published <- format(data$status_published, format = "%m/%d/%Y %H:%M")
#remove time and keeping date
data$status_published <- as.POSIXct(data$status_published, format = "%m/%d/%Y")
head(data)
## status_id status_type status_published num_reactions num_comments num_shares
## 1 1 video 2018-04-22 529 512 262
## 2 2 photo 2018-04-21 150 0 0
## 3 3 video 2018-04-21 227 236 57
## 4 4 photo 2018-04-21 111 0 0
## 5 5 photo 2018-04-18 213 0 0
## 6 6 photo 2018-04-18 217 6 0
## num_likes num_loves num_wows num_hahas num_sads num_angrys
## 1 432 92 3 1 1 0
## 2 150 0 0 0 0 0
## 3 204 21 1 1 0 0
## 4 111 0 0 0 0 0
## 5 204 9 0 0 0 0
## 6 211 5 1 0 0 0
#Checking for NAs after processing
data[rowSums(is.na(data)) > 0,]
## [1] status_id status_type status_published num_reactions
## [5] num_comments num_shares num_likes num_loves
## [9] num_wows num_hahas num_sads num_angrys
## <0 rows> (or 0-length row.names)
#Number of distinct rows
nrow(distinct(data))
## [1] 7050
# Unique status_type, we have 4 types of status - video, photo, link and status.
unique(data$status_type)
## [1] "video" "photo" "link" "status"
# Checking the frequency of status update per day, it doesn't give any significant insight
x <- as.data.frame(plyr::count(data, "status_published"))
x <- x %>% arrange(desc(freq))
head(x)
## status_published freq
## 1 2018-06-07 51
## 2 2018-06-09 43
## 3 2018-06-11 42
## 4 2018-05-25 38
## 5 2018-06-08 35
## 6 2018-05-23 33
# Analyszing internal structure of dataset
str(data)
## 'data.frame': 7050 obs. of 12 variables:
## $ status_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ status_type : chr "video" "photo" "video" "photo" ...
## $ status_published: POSIXct, format: "2018-04-22" "2018-04-21" ...
## $ num_reactions : int 529 150 227 111 213 217 503 295 203 170 ...
## $ num_comments : int 512 0 236 0 0 6 614 453 1 9 ...
## $ num_shares : int 262 0 57 0 0 0 72 53 0 1 ...
## $ num_likes : int 432 150 204 111 204 211 418 260 198 167 ...
## $ num_loves : int 92 0 21 0 9 5 70 32 5 3 ...
## $ num_wows : int 3 0 1 0 0 1 10 1 0 0 ...
## $ num_hahas : int 1 0 1 0 0 0 2 1 0 0 ...
## $ num_sads : int 1 0 0 0 0 0 0 0 0 0 ...
## $ num_angrys : int 0 0 0 0 0 0 3 1 0 0 ...
# We don't need status_id, status_published date for building clusters so we drop them
data <- data[,c(2,4:12)]
head(data)
## status_type num_reactions num_comments num_shares num_likes num_loves
## 1 video 529 512 262 432 92
## 2 photo 150 0 0 150 0
## 3 video 227 236 57 204 21
## 4 photo 111 0 0 111 0
## 5 photo 213 0 0 204 9
## 6 photo 217 6 0 211 5
## num_wows num_hahas num_sads num_angrys
## 1 3 1 1 0
## 2 0 0 0 0
## 3 1 1 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 1 0 0 0
df <- data[,2:ncol(data)]
head(df)
## num_reactions num_comments num_shares num_likes num_loves num_wows num_hahas
## 1 529 512 262 432 92 3 1
## 2 150 0 0 150 0 0 0
## 3 227 236 57 204 21 1 1
## 4 111 0 0 111 0 0 0
## 5 213 0 0 204 9 0 0
## 6 217 6 0 211 5 1 0
## num_sads num_angrys
## 1 1 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
# we check the assumption that num_reactions = num_likes + num_loves + num_wows + num_hahas + num_sads + num_angrys
count <- rep(NA,nrow(df))
l <- rep(0,nrow(df))
for (i in 1:nrow(df))
{
if(df[i,"num_reactions"] == df[i,"num_likes"] + df[i,"num_loves"] + df[i,"num_wows"] + df[i,"num_hahas"] + df[i,"num_sads"] + df[i,"num_angrys"])
{
count[i] = TRUE
}
else
{
count[i] = FALSE
l[i] = df[i,"num_reactions"] - (df[i,"num_likes"] + df[i,"num_loves"] + df[i,"num_wows"] + df[i,"num_hahas"] + df[i,"num_sads"] + df[i,"num_angrys"])
}
}
df[count==FALSE,]
## num_reactions num_comments num_shares num_likes num_loves num_wows
## 239 885 462 26 659 220 0
## 248 264 2 0 256 2 5
## 249 313 3 0 297 7 6
## 252 247 6 0 234 9 1
## 254 387 3 0 368 16 1
## 255 178 9 0 170 6 0
## 257 270 3 0 256 10 3
## 258 351 4 1 344 6 0
## 294 616 523 21 459 125 21
## num_hahas num_sads num_angrys
## 239 2 0 0
## 248 0 0 0
## 249 0 0 0
## 252 0 0 0
## 254 0 0 0
## 255 0 0 0
## 257 0 0 0
## 258 0 0 0
## 294 8 0 1
l[l!=0]
## [1] 4 1 3 3 2 2 1 1 2
#We assume that the data that doesn't follow the assumption is incorrect, we check the total percent of it
# As it is less than 1%, we decide to delete it
(length(l[l!=0])/nrow(df) ) * 100
## [1] 0.1276596
# Removing the data that does not justify the assumption
df <- df[!count==FALSE,]
str(df)
## 'data.frame': 7041 obs. of 9 variables:
## $ num_reactions: int 529 150 227 111 213 217 503 295 203 170 ...
## $ num_comments : int 512 0 236 0 0 6 614 453 1 9 ...
## $ num_shares : int 262 0 57 0 0 0 72 53 0 1 ...
## $ num_likes : int 432 150 204 111 204 211 418 260 198 167 ...
## $ num_loves : int 92 0 21 0 9 5 70 32 5 3 ...
## $ num_wows : int 3 0 1 0 0 1 10 1 0 0 ...
## $ num_hahas : int 1 0 1 0 0 0 2 1 0 0 ...
## $ num_sads : int 1 0 0 0 0 0 0 0 0 0 ...
## $ num_angrys : int 0 0 0 0 0 0 3 1 0 0 ...
# Scaling the data for better analysis of clusters
df <- scale(df)
df<- as.data.frame(df)
head(df)
## num_reactions num_comments num_shares num_likes num_loves num_wows
## 1 0.646222253 0.32297476 1.6854261 0.482786321 1.98781086 0.19654943
## 2 -0.172663058 -0.25219846 -0.3042799 -0.144283441 -0.31800076 -0.14742021
## 3 -0.006293219 0.01292045 0.1285950 -0.024206253 0.20832581 -0.03276367
## 4 -0.256928301 -0.25219846 -0.3042799 -0.231005855 -0.31800076 -0.14742021
## 5 -0.036542280 -0.25219846 -0.3042799 -0.024206253 -0.09243223 -0.14742021
## 6 -0.027899691 -0.24545815 -0.3042799 -0.008640691 -0.19268491 -0.03276367
## num_hahas num_sads num_angrys
## 1 0.07681282 0.4730465 -0.1556598
## 2 -0.17579767 -0.1526759 -0.1556598
## 3 0.07681282 -0.1526759 -0.1556598
## 4 -0.17579767 -0.1526759 -0.1556598
## 5 -0.17579767 -0.1526759 -0.1556598
## 6 -0.17579767 -0.1526759 -0.1556598
# Making a vector of status type groups. It can be used in case of verifying cluster accuracy.
group <- data$status_type
for (i in 1:length(group))
{
if (group[i] == "video")
{
group[i] <- 1
}
else if (group[i] == "photo")
{
group[i] <- 2
}
else if (group[i] == "link")
{
group[i] <- 3
}
else if (group[i] == "status")
{
group[i] <- 4
}
}
group <- group[!count==FALSE]
summary(group)
## Length Class Mode
## 7041 character character
We can that the number of likes and number of reactions have a high correlation; Also, the number of the loves statues and number of shares; the number of loves and number of hahas; number of loves and number of wows; the number of loves and number of comments; the number of shares and number of comments have a significant correlation.
corrplot(cor(df))
cor(df)
## num_reactions num_comments num_shares num_likes num_loves
## num_reactions 1.00000000 0.1508182 0.2508625 0.99494086 0.3044552
## num_comments 0.15081817 1.0000000 0.6406324 0.10167228 0.5221832
## num_shares 0.25086252 0.6406324 1.0000000 0.17258510 0.8221813
## num_likes 0.99494086 0.1016723 0.1725851 1.00000000 0.2089152
## num_loves 0.30445520 0.5221832 0.8221813 0.20891524 1.0000000
## num_wows 0.26766210 0.1623931 0.4078910 0.20773640 0.5094976
## num_hahas 0.17584823 0.3250046 0.3999405 0.12066387 0.5082269
## num_sads 0.07522229 0.2364420 0.1999311 0.05222869 0.2082756
## num_angrys 0.12427280 0.2251306 0.3125406 0.08739908 0.3715780
## num_wows num_hahas num_sads num_angrys
## num_reactions 0.26766210 0.1758482 0.07522229 0.12427280
## num_comments 0.16239310 0.3250046 0.23644201 0.22513056
## num_shares 0.40789100 0.3999405 0.19993111 0.31254065
## num_likes 0.20773640 0.1206639 0.05222869 0.08739908
## num_loves 0.50949759 0.5082269 0.20827560 0.37157800
## num_wows 1.00000000 0.2873832 0.08660209 0.18280525
## num_hahas 0.28738322 1.0000000 0.14148086 0.21165205
## num_sads 0.08660209 0.1414809 1.00000000 0.14209099
## num_angrys 0.18280525 0.2116520 0.14209099 1.00000000
As we have 4 status types - video, photo, status and links, I will start with building a model with 4 clusters.
model <- kmeans(df,centers = 4, nstart = 20)
summary(model)
## Length Class Mode
## cluster 7041 -none- numeric
## centers 36 -none- numeric
## totss 1 -none- numeric
## withinss 4 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 4 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
Observing the status types in each clusters.
table(data[!count == FALSE,]$status_type, model$cluster)
##
## 1 2 3 4
## link 0 49 14 0
## photo 1 4045 212 23
## status 0 295 70 0
## video 33 1870 76 353
Plotting Reactions vs Comments in 4 clusters
ggplot(data = df, aes(num_reactions, num_comments, color = model$cluster)) + geom_point()
Plotting number of reactions vs number of shares in 4 clusters
ggplot(data = df, aes(num_reactions, num_shares, color = model$cluster)) + geom_point()
Plotting number of shares vs number of comments in 4 clusters
ggplot(data = df, aes(num_shares, num_comments, color = model$cluster)) + geom_point()
Plotting positive emotions (love, wow, haha and likes) vs negative emotions (sad and angry) in 4 cluster analysis
ggplot(data = df, aes(num_likes+num_loves+num_wows + num_hahas, num_sads+num_angrys, color = model$cluster)) + geom_point()
Clusplot to plot 2 components in 4 clusters. The first two components explain ~ 58 % of the variability
clusplot(df,model$cluster)
Using Elbow method, Silhouette method and Gap statistic method I try to calculate the optimal number of clusters. Basod on my intuition, I chose 4 earlier as there were 4 groups of status type and it will be good to check if it was the right choice.
# Elbow method
fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "Elbow method")
# Plotting elbow plot with variance explained
opt <- Optimal_Clusters_KMeans(df, max_clusters=10, plot_clusters = TRUE)
# Silhouette method
fviz_nbclust(df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method")
# Gap statistic method
fviz_nbclust(df, kmeans, nstart = 20, method = "gap_stat", nboot = 10)+ labs(subtitle = "Gap statistic method")
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
In K-means, we set number of cluster is 2. The silhouette width is 0.77 and the plot looks convincing.
# Investigating k =2
fviz_cluster(kmeans(df, 2, nstart = 20), data = df)
km.sil<-silhouette((kmeans(df,2))$cluster, dist(df))
fviz_silhouette(km.sil)
## cluster size ave.sil.width
## 1 1 6426 0.86
## 2 2 615 -0.15
In K-means, we set number of cluster is 3. The silhouette width is 0.75 which is a bit lower than 2 clusters.
# Investigating k =3
fviz_cluster(kmeans(df, 3, nstart = 20), data = df)
km.sil<-silhouette((kmeans(df,3))$cluster, dist(df))
fviz_silhouette(km.sil)
## cluster size ave.sil.width
## 1 1 273 -0.11
## 2 2 6377 0.80
## 3 3 391 0.49
In K-means, we set number of cluster is 4. The silhouette width is 0.76 which is higher than 3 clusters but lower than 2. It is in the middle and the plot looks convincing.
fviz_cluster(kmeans(df, 4, nstart = 20), data = df)
km.sil<-silhouette((kmeans(df,4))$cluster, dist(df))
fviz_silhouette(km.sil)
## cluster size ave.sil.width
## 1 1 34 0.06
## 2 2 376 -0.05
## 3 3 372 0.51
## 4 4 6259 0.83
Using PAM for additional analysis
# Investigating K = 2
c1<-eclust(df, "pam", k= 2)
fviz_silhouette(c1)
## cluster size ave.sil.width
## 1 1 6511 0.80
## 2 2 530 0.04
fviz_cluster(c1)
# Investigating k = 3
c2<-eclust(df, "pam", k= 3)
fviz_silhouette(c2)
## cluster size ave.sil.width
## 1 1 803 -0.17
## 2 2 5824 0.85
## 3 3 414 0.45
fviz_cluster(c2)
# Investigating k = 4
c3<-eclust(df, "pam", k=4)
fviz_silhouette(c3)
## cluster size ave.sil.width
## 1 1 786 -0.17
## 2 2 4645 0.80
## 3 3 1277 -0.15
## 4 4 333 0.51
fviz_cluster(c3)
In the paper we analyzed k-means algorithm on the dataset. We found out that k=2 and k=4 will provide good results. We analysed the results from k-means with PAM clustering. K-means provided better results so we stick to it. As we have 4 status type, (video, photos, statuses, and links) if we try to match the points to cluster prediction and their group vector that I created earlier, for k= 2 it will lead to a lot of errors as there is no group 3 or 4. I will stick with k = 4 and it is confirmed to be a good result with the analysis from k-means.
References:
The K-Means Clustering Algorithm Intuition Demonstrated In R, https://uc-r.github.io/kmeans_clustering#elbow
University of Warsaw, Unsupervised Learning Course by dr Jacek Lewkowicz