In this paper various clustering methods will be put to use on dataset conataining avarege ratings of different types of places from Google reviews. Data contains 5456 instances with 25 attributes and was downloaded from UCI Machine Learning Repository.
Here we can see packages used in this paper and code used to install and load them.
requiredPackages = c("readr", "cluster", "factoextra", "flexclust", "fpc", "clustertend", "ClusterR", "ggplot2", "plotly") # list of packages add new packages and run the code
for(i in requiredPackages) {if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages) {library(i,character.only = TRUE)}
After installation of packages we can start forming our clusters. As our data shows avarage ratings of Google reviews users we can discover patterns in human behaviour. For example a simple cluster of average of ratings on burger/pizza shops ( Category 13) with average ratings of pubs/bars (Category 11).
km1 <- kmeans(set_01f, 4)
fviz_cluster(list(data=set_01, cluster=km1$cluster), ellipse.type="norm", geom="point",
stand=FALSE, palette="jco", ggtheme=theme_classic()) #factoextra::
To a certain degree we can see that that people who on average dislike bars and pubs tend to dislike pizza and burger shops. Grouping users in that manner can be beneficial for example for targeting advertisment.
sil<-silhouette(km1$cluster, dist(set_01f))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 1077 0.58
## 2 2 2058 0.64
## 3 3 913 0.62
## 4 4 1408 0.85
By additionally showing silhouette plot we can clearly see that cluster 3 is showing the biggest dispersion, hence the biggest width.
Partitioning around medoids is another method of clustering presented in this paper. It is said to be more robust to outliers when compared to previously shown k-means method of clsutering. In this example we can see clustering of data of average ratings of art galleries (category 16) and average ratings of museums (category 7).
pam1 <- pam(set_2, 3)
fviz_cluster(pam1, geom = "point", ellipse.type = "norm")
Here we can see that no large group of higher ratings for both categories exists. Describing chances of one individual liking the second category based on ratings of the other one may prove to be diffucult with this set of data. Yet that information is valuable too, if we were to find the best facility to recommend to a lover of one of those categories, we would be prompted to look in other categories.
Another method that is useful and used often when it comes to data mining is CLustering LArge Applications and as name suggests is designed for large data sets. In this example we can see set that contains data about average reviews on theaters (Category 6) and pubs/bars (Category 11).
cl2<-eclust(set_4, "clara", k=4) # factoextra
As we can see cluster that we obtained would suggest that lovers of both places are the smallest group and users of Google review that enjoy their time at the theaters experience rather moderate enjoyment when it comes to pubs and bars.
fviz_silhouette(cl2)
## cluster size ave.sil.width
## 1 1 2228 0.02
## 2 2 1379 0.42
## 3 3 1147 0.22
## 4 4 702 0.33
In this case the silhouettes of all three clusters are very similiar. Each has more or less the same width showing that all three clusters have similiar dispersion of points. And that a fourth, red cluster is characterised by much bigger dispersion of points.
In some cases as this one we may want to see clusters that are made of more than two categories and R provides us with easy and quick methods of doing so. First to determine the number of clusters, as visual inspection may prove to be difficult we will use a clsuter dendrogram.
set_3 <- google_review_ratings[,c(5,6,11)]
hc = hclust(dist(set_3), method = "ward.D")
hc
##
## Call:
## hclust(d = dist(set_3), method = "ward.D")
##
## Cluster method : ward.D
## Distance : euclidean
## Number of objects: 5456
plot(hc, hang = -5)
With a dendrogram plotted we could argue about the number of clusters but 3 seems like a reasonable choice.
cluster_set_3 <- cutree(hc, k=3)
set_3$cluster_set_3 <-cluster_set_3
var1 <- set_3$Category.4
var2 <- set_3$Category.5
var3 <- set_3$Category.10
clusters3D <- as.factor(set_3$cluster_set_3)
Here our dimensions consist of data on average ratings of resorts on axis x, parks on axis y and zoos on axis z.
plot_ly(x=var1, y=var2, z=var3, type="scatter3d", mode="markers", color=set_3$cluster_set_3)
With a simple line of code and usage of plotly function we can create an interactive 3D plot of our clusters clearly visible thanks to coloring. Analysis of such clusters in not as easy as analysis of two dimenional ones but in some cases it may prove to be useful. Here for example we can see a cluster of people, marked in a green color, that consists of users that gave low ratings of facilities in all three categories. With that knowledge we can safely assume that a user who disliked two out of three of those facilities will also dislike the third one which for many business purposes is an important piece of knowledge.
Clustering offers us a quick way to extract a lot of information from datasets, especially when using such packages it becomes a quick and easy to learn process. As it often is the hardest part of such process is useful interpretation of results. Yet using different methods of clustering is very quick and possible for even large datasets, use of different packages also makes it possible for us to present our findings in clear and nice way, either as simple 2D graphics or as interactive 3D plots.