Project 1: Clustering

Introduction

This project revolves around the matter of clustering with the use of unsupervised learning measures. The goal of this article is to cluster a book based on its short description, as opposed to the typical grouping based on genre. The database in use was fully built by me in SQLite Studio and consists of 100 titles of several different genres such as Thriller, Science Fiction or Contemporary Fiction, thereby varying in main themes. All the inputs were collected with the use of the Goodreads platform (the most used book online database of books, reviews and personal libraries). I will attempt to generate the optimal number of clusters, divide the database contents respectively to the measures and draw an analysis according to it.

First, let’s load all the libraries we will use in the study.

library("knitr")
library("topicmodels")
library("tidytext")
library("dendextend")

## 
## ---------------------
## Welcome to dendextend version 1.14.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------

## 
## Attaching package: 'dendextend'

## The following object is masked from 'package:stats':
## 
##     cutree

library("fpc")
library("ggplot2")
library("pdp")
library("cluster")
library("stats")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("h2o")

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

library("tm")

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library("stopwords")

## 
## Attaching package: 'stopwords'

## The following object is masked from 'package:tm':
## 
##     stopwords

library("SnowballC")
library("plyr")

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library("corpus")
library("radiant.data")

## Loading required package: magrittr

## Loading required package: lubridate

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:h2o':
## 
##     day, hour, month, week, year

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

## Loading required package: tidyr

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

## 
## Attaching package: 'radiant.data'

## The following objects are masked from 'package:lubridate':
## 
##     month, wday

## The following object is masked from 'package:magrittr':
## 
##     set_attr

## The following object is masked from 'package:h2o':
## 
##     month

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

## The following object is masked from 'package:base':
## 
##     date

library("factoextra")

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Clustering Data in the Form of Text

In the sphere of unsupervised learning classification, the labels are not assigned to particular data inputs. The approach of unsupervised learning prompts understanding of the structure of the undisclosed data (Zwitch, 2013). Such unlabeled patterns of data are then ordered using clustering measures. An average process of clustering would entail representation of the data pattern, interpretation of the pattern closeness estimate and eventual grouping (Shaheen et al., 2011). Depending on the features of the dataset in use, different methods of clustering can be applied for the optimum outputs. When it comes to clustering text data, it is predominantly suggested to refer to k-means clustering, which segregates the inputs into k subgroups, with each particular input being allocated into the one closest in the distance to its center (Zwitch, 2013). Within the area of k-means we can distinguish k-means and spherical k-means. However, because the database in use consists of 100 observations only, k-means will be more appropriate, and hence, used in the project. Additionally to the k-means measure, partitioning around medoids (PAM) (Ibrahim & Harbi, 2013) and Latent Dirichlet Allocation (LDA) (Kulshrestha, 2019), a great method for topic modeling when it comes to text based datasets, will be applied.

Data Preprocessing

In order for the data to be successfully used and analyzed we need to slightly preprocess it. However, it will not be an overly extensive process, thanks to the fact that the database was handcrafted, therefore the data set needs no cleaning, only correct reading into the R environment. Initially the database consisted of 12 variables, them being: Book ID, Title, whether it is English original, Author, Number of Pages, Publication Year, ISBN number, link to the book’s Goodreads profile, Book Description, Genre, whether it is a series, and the Book Cover. After a topic formulation, I decided to leave the following variables: Title, Description, Genre. Thus, the database consists of 3 variables and 100 different observations. There are 8 different genres in the database: * Historical Fiction * Contemporary Fiction * Thriller * Horror * Mystery/Crime * Non-Fiction * Romance * Science Fiction

opts_knit$set(root.dir = "/Users/ola/Desktop/UW/Semester 1/Unsupervised Learning/Project 1")

Data Reading

books <- read.csv("databasebooks.csv", header = FALSE, col.names = c("Title", "Description", "Genre"))

head(books, 3)

##                  Title
## 1             The Muse
## 2      The Hating Game
## 3 What Lies Between Us
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Description
## 1                                                                                                                                                  On a hot July day in 1967, Odelle Bastien climbs the stone steps of the Skelton gallery in London, knowing that her life is about to change forever. Having struggled to find her place in the city since she arrived from Trinidad five years ago, she has been offered a job as a typist under the tutelage of the glamorous and enigmatic Marjorie Quick. But though Quick takes Odelle into her confidence, and unlocks a potential she didn't know she had, she remains a mystery - no more so than when a lost masterpiece with a secret history is delivered to the gallery.\n\nThe truth about the painting lies in 1936 and a large house in rural Spain, where Olive Schloss, the daughter of a renowned art dealer, is harbouring ambitions of her own. Into this fragile paradise come artist and revolutionary Isaac Robles and his half-sister Teresa, who immediately insinuate themselves into the Schloss family, with explosive and devastating consequences
## 2 Lucy Hutton has always been certain that the nice girl can get the corner office. She’s charming and accommodating and prides herself on being loved by everyone at Bexley & Gamin. Everyone except for coldly efficient, impeccably attired, physically intimidating Joshua Templeman. And the feeling is mutual.\n\nTrapped in a shared office together 40 (OK, 50 or 60) hours a week, they’ve become entrenched in an addictive, ridiculous never-ending game of one-upmanship. There’s the Staring Game. The Mirror Game. The HR Game. Lucy can’t let Joshua beat her at anything—especially when a huge new promotion goes up for the taking.\n\nIf Lucy wins this game, she’ll be Joshua’s boss. If she loses, she’ll resign. So why is she suddenly having steamy dreams about Joshua, and dressing for work like she’s got a hot date? After a perfectly innocent elevator ride ends with an earth-shattering kiss, Lucy starts to wonder whether she’s got Joshua Templeman all wrong.\n\nMaybe Lucy Hutton doesn’t hate Joshua Templeman. And maybe, he doesn’t hate her either. Or maybe this is just another game.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                             They say every house has its secrets, and the house that Maggie and Nina have shared for so long is no different. Except that these secrets are not buried in the past.\n\nEvery other night, Maggie and Nina have dinner together. When they are finished, Nina helps Maggie back to her room in the attic, and into the heavy chain that keeps her there. Because Maggie has done things to Nina that can’t ever be forgiven, and now she is paying the price.\n\nBut there are many things about the past that Nina doesn’t know, and Maggie is going to keep it that way—even if it kills her.\n\nBecause in this house, the truth is more dangerous than lies.
##                Genre
## 1 Historical Fiction
## 2            Romance
## 3           Thriller

Genre grouping

After reading the data in, we can briefly scan how would the titles be distributed, had we used the conventional genre-based division, as it is done in bookstores and libraries. As we have 8 genres in the data set, the inputs will be partitioned respectively. Because we deal with text data, we cannot draw means or medians, as we would with numerical data for some introductory descriptive statistics.

table(books$Genre)

## 
## Contemporary Fiction   Historical Fiction               Horror 
##                   16                   12                   12 
##        Mystery/Crime          Non-Fiction              Romance 
##                   10                    7                   12 
##      Science Fiction             Thriller 
##                   19                   12

As we can see, the data is moderately evenly spread, resembling 12 positions or so. Will the clusters be evenly divided too? Let’s find out.

Data Standardisation

Now onto the data standardization. First we need to change each element of the vector in the variable Description to lower case using function tolower(). The next step is to remove the stopwords from the same variable, predefined in the function of the same name using its primary arguments. Then we utilize the same package with the function removePunctuaction, rather self explanatory. Eventually, we need to implement word stemming, to get to the roots and enable the topic modeling.

books$Description <- tolower(books$Description)

books$Description <- removeWords(books$Description, words = stopwords())
books$Description <- removePunctuation(books$Description, ucp = TRUE)
books$Description <- stemDocument(books$Description)

books_corpus = as_corpus_frame(books$Description)

Text Segmentation

The following significant step is the text segmentation - an activity of partitioning the written text into meaningful units, that can be words, topics or sentences (Ponte & Croft, 1997). We will consider a Document Term Matrix (DTM) to track the term frequency for each of them (Munoz, 2020). After the descriptions are divided, we will dismiss the words of the lowest recurrence for more efficient clustering, as we have 2454 variables (separate text inputs).

frequency_matrix <- term_matrix(books_corpus)
frequency_matrix = as.data.frame(as.matrix(frequency_matrix))

head(books_corpus, 3)

##   text                                                                          
## 1 hot juli day 1967 odell bastien climb stone step skelton galleri london know …
## 2 luci hutton alway certain nice girl can get corner offic s charm accommod pri…
## 3 say everi hous secret hous maggi nina share long differ except secret buri pa…

#Word elimination by lowest frequency
transposed = t(frequency_matrix)
transposed = as.data.frame(transposed)
transposed$counter = rowSums(transposed)

transposed = rownames_to_column(transposed, var = "rowname")

wordcount = as.data.frame(transposed$counter, transposed$rowname)
colnames(wordcount) <- c("count")

count(wordcount$count)

##     x freq
## 1   1 1421
## 2   2  401
## 3   3  215
## 4   4  115
## 5   5   73
## 6   6   48
## 7   7   30
## 8   8   24
## 9   9   19
## 10 10   14
## 11 11   12
## 12 12    4
## 13 13    5
## 14 14    5
## 15 15    6
## 16 16    3
## 17 17    3
## 18 18    4
## 19 19    4
## 20 20    2
## 21 21    2
## 22 22    2
## 23 24    3
## 24 25    2
## 25 28    1
## 26 29    1
## 27 32    1
## 28 33    1
## 29 36    1
## 30 38    1
## 31 42    2
## 32 66    1

As the highest frequency equals 66, and the lowest is 1, there is a wide difference between the extremes. Hence, I decided to remove all the words that occur less than 4 times, thereby leaving those that occurred 4 times or more.

high_occurence = select(transposed, rowname, counter)
high_occurence %>% arrange(desc(counter)) %>% head(10)

##    rowname counter
## 1      one      66
## 2     life      42
## 3        s      42
## 4     year      38
## 5      new      36
## 6    world      33
## 7     love      32
## 8     find      29
## 9     live      28
## 10     day      25

As the letter s has been placed as a third most frequent “word”, we will remove it from the dataset, as it is not giving any insight into the topic modeling.

#Deleting the words that occur less than 4 times 
highest_occurence = high_occurence[high_occurence$counter >= 4 & high_occurence$rowname != "s", ]
highest_occurence %>% arrange(counter) %>% head(10)

##     rowname counter
## 1    alicia       4
## 2   america       4
## 3  american       4
## 4    around       4
## 5    author       4
## 6      bear       4
## 7     began       4
## 8      best       4
## 9    better       4
## 10  biggest       4

Document Term Matrix (DTM)

Let’s see how the dataset looks like when we apply the counter as higher or equal to 4.

dtm_4_word  = dplyr::filter(transposed, transposed$counter > 4) %>% select(-counter)
col_names_4 <- dtm_4_word$rowname
dtm_4_word <- dtm_4_word %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_4_word) <- col_names_4
head(dtm_4_word, 8)[1:8]

##    across age ago aliv allow also alway anim
## V1      0   0   1    0     0    0     0    0
## V2      0   0   0    0     0    0     1    0
## V3      0   0   0    0     0    0     0    0
## V4      0   0   0    0     0    0     0    0
## V5      0   0   0    0     0    0     0    0
## V6      0   0   1    0     0    1     0    0
## V7      0   0   0    0     0    0     0    0
## V8      0   0   0    0     0    0     0    0

We arrive at a dataframe of 100 observations and 274 variables. This is the starting point for the clustering process.

As aforementioned, we cannot draw the usual descriptive statistics on the initial dataset. However, after standardisation, text segmentation and LDA, we can see how the final word occurence is distributed.

summary(highest_occurence)

##    rowname             counter      
##  Length:388         Min.   : 4.000  
##  Class :character   1st Qu.: 4.000  
##  Mode  :character   Median : 6.000  
##                     Mean   : 7.809  
##                     3rd Qu.: 9.000  
##                     Max.   :66.000

As we can see, the extremes are as shown above, the median and mean are much closer to the minimum value than to the maximum. Given the mean is higher than the median, we are dealing with a right skewed distribution.

Clustering: K-Means and PAM

As the database has been handmade, we don’t need to scale the date, as well as checking for missing value. We can start with examining the clustering tendency of the data, using the factoextra package. The function get_clust_tendency evaluates the clustering tendency of the data via Hopkin’s statistic and a visualisation measure, giving a dissimilarity image. Noteworthy, the more dissimilarities there are, the better for the clustering process, as it will be defined to a greater extent (Easy Guides, 2016). Moreover, the Hopkin’s statistic needs to exceed 0.75 for the dataset to be assumed clusterable. With such result, we can indicate the clustering tendency at 90% confidence interval. The null and alternative hypotheses are as follow: * Null hypothesis: the data set is uniformly distributed (no meaningful clusters) * Alternative hypothesis: the data set D is not uniformly distributed (contains meaningful clusters) (Data Novia, 2018).

To be able to check the clusterability of our data, we need to transform the text to numerics. Therefore I added a new column “word_id” to later be able to navigate on the inputs. For the get_clust_tendency function we will create a new dataframe with only counter and word_id variables.

highest_occurence$word_id <- c(1:nrow(highest_occurence))
clusterability <- select(highest_occurence, counter, word_id)
get_clust_tendency(clusterability, nrow(clusterability) - 1, graph = TRUE, gradient = list(low = "red", mid = "white", high = "blue"), seed = 123)

## $hopkins_stat
## [1] 0.8767744
## 
## $plot

As we can see, Hopkin’s statistic is well above 0.75, reaching 0.877, therefore out data is highly clusterable. On the graph, the color level is propotional for the dissimilarity level between the observations. Now, knowing we will apply k-means and PAM, we can build the sillhouettes for both measures.

Silhouette

par(mfrow = c(1, 2))

kmeans <- fviz_nbclust(dtm_4_word, kmeans, method = "s") + ggtitle("K-Means")
pam <- fviz_nbclust(dtm_4_word, FUNcluster = cluster::pam, method = "silhouette") + ggtitle("PAM")


grid.arrange(kmeans, pam, ncol = 2, bottom = "Optimal Number of Clusters")

From the graphs we can see that for k-means the optimal number of clusters would be 3, resulting in the highest silhouette width. On the other hand, for PAM, the optimal number of clusters is 2. As the value of silhouette width varies from -1 to 1, and the higher the score, the clusters are well-separated from each other. Negative score indicates the wrong assignment of the samples, while the one close to 0 represents overlapping clusters (Kumar, 2020). Because the PAM’s silhouette width in the 2 clusters level equals 0.4, which is the highest of all, with quite good separation of clusters, we will proceed with 2 clusters.

Clustering

Now that we have established we will proceed with 2 clusters, we can move onto the clusterisation process implementing both K-Means and PAM. Because K-Means Silhouette suggested 3 clusters, I will experiments and include the clustering for such scenario. Therefore, we will obtain graph from K-Means with 2 and 3 clusters, as well as PAM with 2 and 3 clusters.

kmclusters2 <- eclust(dtm_4_word, FUNcluster = "kmeans", k = 2, hc_metric = "euclidean", graph = FALSE)
kmclusters3 <- eclust(dtm_4_word, FUNcluster = "kmeans", k = 3, hc_metric = "euclidean", graph = FALSE)

pamclusters2 <- eclust(dtm_4_word, FUNcluster = "pam", k = 2, hc_metric = "euclidean", graph = FALSE)
pamclusters3 <- eclust(dtm_4_word, FUNcluster = "pam", k = 3, hc_metric = "euclidean", graph = FALSE)

#Clustering with factoextra
cluster1 <- fviz_cluster(kmclusters2, geom = c("point")) + ggtitle("K-Means with 2 Clusters")
cluster2 <- fviz_cluster(kmclusters3, geom = c("point")) + ggtitle("K-Means with 3 Clusters")

cluster3 <- fviz_cluster(pamclusters2, geom = c("point")) + ggtitle("PAM with 2 Clusters")
cluster4 <- fviz_cluster(pamclusters3, geom = c("point")) + ggtitle("PAM with 3 Clusters")

grid.arrange(cluster1, cluster2, cluster3, cluster4, ncol = 2, bottom = "Clustering with K-Means and PAM")

##Follow Up Analysis After the optimal number of clusters generation, there is not much space for changes in the k-means analysis. Therefore, the results we obtained from the fviz_nbclust function, are those we are stuck with. As we can see from the graphs, the results from PAM algorithm are rather pointless, as we are given one major cluster with some insignificant residues in the remaining clusters. With the K-Means algorithm, the situation looks a little bit different. Although we are dealing we overlapping clusters - it was unevitable due to the low average silhouette width (approximately 0.16 and 0.18 respectively for the scenarios), we have clearly defined multiple clusters, even if in the case with three clusters one of them is rather insignificant. That, however, was too expected, as the average silhouette width between 2 and three clusters was rather small, and in the first case of 2 clusters, we can clearly see that the second cluster “pulled in” the third cluster from the case of 3 clusters. Now we can examine the average widths of all clusters for k-means settings.

K-Means

km1 <- eclust(dtm_4_word, FUNcluster = "kmeans", k = 2, hc_metric = "euclidean")

km2 <- eclust(dtm_4_word, FUNcluster = "kmeans", k = 3, hc_metric = "euclidean")

sill_km1 <- fviz_silhouette(km1)  + ggtitle("K-Means with 2 clusters")

##   cluster size ave.sil.width
## 1       1   82          0.22
## 2       2   18         -0.11

sill_km2 <- fviz_silhouette(km2)  + ggtitle("K-Means with 3 clusters")

##   cluster size ave.sil.width
## 1       1    1          0.00
## 2       2   20         -0.13
## 3       3   79          0.24

km1$silinfo$clus.avg.widths

## [1]  0.2158704 -0.1110941

km2$silinfo$clus.avg.widths

## [1]  0.0000000 -0.1259044  0.2352798

grid.arrange(sill_km1, sill_km2, ncol = 2 , bottom = "K-Means Cluster Widths")

As we can see from the eclust function, although the clustering activity was conducted, and our observations are divided, the clustering distribution is rather poor. In the case of 2 cluster K-means, second cluster contains less than 10 observations that are not appurtenant to cluster 1, while the rest belongs to both clusters or cluster 1 only. A substantial amount, over 90% of the observations, are allocated in cluster 1. Thus, we do not really have a successful clustering example, we are dealing with a more of defined outliers set, and that is not an anticipated result. In case of the K-means with 3 clusters case, the situation is marginally better, as more observations are fully belonging to cluster 2. Here, however, we are dealing with a single observation (V2) cluster, i.e. cluster 1. The V2 input is so distant to all the other observations, that is creates a need for an individual additional cluster. Cluster 3 is very similar to cluster 1 from the previous K-means test, holding majority of the values - here, approximately 87% of them, some being shared with cluster 2. Both widths graphs for K-Means measure give us negative values. We are facing such outcome due to the significant overlapping of clusters. For an ideal clustering situation, we would like to have well defined detached cluster, therefore, let us check the PAM widths, in case we missed an accurate algorithm application.

PAM

km3 <- eclust(dtm_4_word, FUNcluster = "pam", k = 2, hc_metric = "euclidean")

km4 <- eclust(dtm_4_word, FUNcluster = "pam", k = 3, hc_metric = "euclidean")

sill_km3 <- fviz_silhouette(km3)  + ggtitle("PAM with 2 clusters")

##   cluster size ave.sil.width
## 1       1   99          0.41
## 2       2    1          0.00

sill_km4 <- fviz_silhouette(km4)  + ggtitle("PAM with 3 clusters")

##   cluster size ave.sil.width
## 1       1   98          0.32
## 2       2    1          0.00
## 3       3    1          0.00

km3$silinfo$clus.avg.widths

## [1] 0.4089669 0.0000000

km4$silinfo$clus.avg.widths

## [1] 0.3189033 0.0000000 0.0000000

grid.arrange(sill_km3, sill_km4, ncol = 1, bottom = "PAM Cluster Widths")

Here we are also faced with a conundrum, but a different kind than in the K-Means clustering. On one side, we are not dealing with negative values for cluster widths. However, on the other hand, the main clusters are so dominant, that the other clusters could be non-existent, for both PAM scenarios. Such outputs might indicate a feature of non-clusterability, especially in the case of PAM algorithm. To further examine the attribute of clusterability, lets conduct a Duda Hart test, to see whether the data should be split into 2 clusters (one of the scenarios).

Duda Hart test

dist <- dist(dtm_4_word)

hclust_2 <- cutree(hclust(dist), 2)
dudahart2(dtm_4_word, hclust_2)

## $p.value
## [1] 8.739347e-07
## 
## $dh
## [1] 0.9568937
## 
## $compare
## [1] 0.971314
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

For 2 clusters division, we obtained FALSE indication for cluster1, and the dh is lower than compare, leading to a belief that our data is clusterable. However, we also obtained a p-value of 8.739347e-07, significantly higher than the 5% confidence interval. Given such p-value, we fail to reject the null hypothesis, meaning the data is rather not clusterable. Therefore, we are faced with a dilemma. Although the initial database consisted of 8 different genres, seemingly being a good set for clustering, its structure in the description variable was not really clusterable, contrary to the results of the Hopkin’s statistic. The common saying that “less is more” does not apply here. After the preprocessing of data, we need to conduct at least two tests to inspect the clusterability. One validation is not enough. More inputs are necessary for the optimal and objective answer.

Since the clusterisation with both K-Means and PAM algorithms turned out to be weak, along with the unfortunate Duda Hart test, there is no point in looking into the clusters independently.

Hence, we will move onto the LDA (Latent Dirichlet Allocation), to examine whether this method gives us sensible results. First we need to establish a number of topics, which will be the number of our genres - 8. The results of LDA will be the probabilities of inputs allocation to each of the categories given (Kulshrestha, 2019).

LDA (Latent Dirichlet Allocation)

#Number of topics
k <- 8

#LDA by genre number
lda1 <- LDA(dtm_4_word, k = 8)
bookstopics <- as.data.frame(tidy(lda1, matrix = "beta"))
head(bookstopics, 10)

##    topic   term         beta
## 1      1 across 0.0009025847
## 2      2 across 0.0005633805
## 3      3 across 0.0027279429
## 4      4 across 0.0005157646
## 5      5 across 0.0018480227
## 6      6 across 0.0011686842
## 7      7 across 0.0055191436
## 8      8 across 0.0020916216
## 9      1    age 0.0030153885
## 10     2    age 0.0002469744

#LDA by proposed cluster number
#2 clusters:
lda2 <- LDA(dtm_4_word, k = 2)
topics2 <- as.data.frame(tidy(lda2, matrix = "beta"))
head(topics2, 10)

##    topic   term         beta
## 1      1 across 0.0019469616
## 2      2 across 0.0018817092
## 3      1    age 0.0054276052
## 4      2    age 0.0007114735
## 5      1    ago 0.0018673066
## 6      2    ago 0.0034879303
## 7      1   aliv 0.0029245319
## 8      2   aliv 0.0009097213
## 9      1  allow 0.0013682999
## 10     2  allow 0.0032205769

#3 clusters:
lda3 <- LDA(dtm_4_word, k = 3)
topics3 <- as.data.frame(tidy(lda3, matrix = "beta"))
head(topics3, 10)

##    topic   term         beta
## 1      1 across 0.0015336294
## 2      2 across 0.0010727041
## 3      3 across 0.0031445971
## 4      1    age 0.0037469512
## 5      2    age 0.0004132162
## 6      3    age 0.0050500203
## 7      1    ago 0.0021376492
## 8      2    ago 0.0040160854
## 9      3    ago 0.0018754947
## 10     1   aliv 0.0034714877

Thus function turned our model into a one-topic-per-term-per-row format. In every one of the combinations, the model calculates the likelihood of such word being generated for a given topic (Silge & Robinson, 2020). For example, for the word age in topic 1, we have the probability of 0.0012665082 for that word being generated from topic 1, but a 0.0046919790 probability in topic 2. Now let’s find the 10 terms that are the most common within each case of LDA.

Graphical representation

bookstopics_top <- bookstopics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

graph1 <- bookstopics_top %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
graph1

topics2_top <- topics2 %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

graph2 <- topics2_top %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
graph2

topics3_top <- topics3 %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

graph3 <- topics3_top %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
graph3

Such visualisation lets us comprehend all the topics that were extracted from the books’ descriptions and the importance of patricular words in each topic. In the 8 topic LDA we can see that we are mainly dealing with a different themes within similar topics. Contrarily to that, we are experiencing a differentiated division in the LDA with 3 topics. As we can see, in the first topic, we have prevailing words such as “life”, “new”, “find”, “perfect”, “love”, “friend”, what we are faced with light literature, possibly a romance or a contemporary fiction. In the second topic, we also see the inputs of a similar tone like “life”, “year”, “girl”, “love”. However, here we can also see the word “murder” and “house”. Although individually, the word “house” is not implying grim themes, paired with murder, it offsets as such. Thereby here, we are probably faced with mysteries, crimes, thrillers and horrors. In the last group we are missing any words that would imply an emotional tone, such as “love”, “perfect” or “friend”. This leads to a conclusion that in this topic non fiction pieces were collected, possibly with some historical fiction, which often is less emotionally visible in its contents. Clearly, the LDA with 3 topics is the one with well defined topics and their inter topical variation. Therefore, the further modeling and estimated will be conducted on that example.

Mixture of topics

In the foregoing LDA implementation we estimated each topic (in a few proportions) as a mixture of words. Noteworthy, the Latent Dirichlet Allocation can also model each input as a mixture of topics, the probability of “gamma”.

#3 clusters proposal:
lda3 <- LDA(dtm_4_word, k = 3)
gamma_topics3 <- as.data.frame(tidy(lda3, matrix = "gamma"))
gamma_topics3

##     document topic     gamma
## 1         V1     1 0.3215404
## 2         V2     1 0.2988681
## 3         V3     1 0.3293738
## 4         V4     1 0.3348483
## 5         V5     1 0.3475081
## 6         V6     1 0.3366960
## 7         V7     1 0.3325619
## 8         V8     1 0.3259601
## 9         V9     1 0.3233657
## 10       V10     1 0.3321911
## 11       V11     1 0.3338519
## 12       V12     1 0.3352423
## 13       V13     1 0.3564392
## 14       V14     1 0.3339955
## 15       V15     1 0.3367066
## 16       V16     1 0.3350518
## 17       V17     1 0.3328983
## 18       V18     1 0.3261169
## 19       V19     1 0.3360924
## 20       V20     1 0.3203928
## 21       V21     1 0.3104655
## 22       V22     1 0.3197920
## 23       V23     1 0.3272363
## 24       V24     1 0.3414163
## 25       V25     1 0.3367408
## 26       V26     1 0.3292251
## 27       V27     1 0.3386484
## 28       V28     1 0.3341220
## 29       V29     1 0.3357512
## 30       V30     1 0.3299186
## 31       V31     1 0.3198550
## 32       V32     1 0.3506165
## 33       V33     1 0.3345448
## 34       V34     1 0.3428116
## 35       V35     1 0.3296410
## 36       V36     1 0.3363177
## 37       V37     1 0.3363033
## 38       V38     1 0.3192089
## 39       V39     1 0.3187139
## 40       V40     1 0.3359234
## 41       V41     1 0.3283923
## 42       V42     1 0.3365927
## 43       V43     1 0.3434108
## 44       V44     1 0.3466726
## 45       V45     1 0.3352648
## 46       V46     1 0.3339672
## 47       V47     1 0.3411828
## 48       V48     1 0.3387230
## 49       V49     1 0.3416308
## 50       V50     1 0.3368197
## 51       V51     1 0.3317552
## 52       V52     1 0.3603571
## 53       V53     1 0.3307099
## 54       V54     1 0.3624518
## 55       V55     1 0.3351038
## 56       V56     1 0.3318647
## 57       V57     1 0.3273141
## 58       V58     1 0.3281405
## 59       V59     1 0.3344631
## 60       V60     1 0.3244291
## 61       V61     1 0.3281422
## 62       V62     1 0.3217019
## 63       V63     1 0.3404639
## 64       V64     1 0.3365691
## 65       V65     1 0.3291343
## 66       V66     1 0.3313170
## 67       V67     1 0.3423513
## 68       V68     1 0.3304112
## 69       V69     1 0.3454695
## 70       V70     1 0.3351560
## 71       V71     1 0.3254003
## 72       V72     1 0.3271794
## 73       V73     1 0.3361888
## 74       V74     1 0.3261266
## 75       V75     1 0.3272812
## 76       V76     1 0.3173777
## 77       V77     1 0.3336810
## 78       V78     1 0.3283569
## 79       V79     1 0.3478938
## 80       V80     1 0.3327108
## 81       V81     1 0.3334715
## 82       V82     1 0.3415375
## 83       V83     1 0.3323304
## 84       V84     1 0.3312360
## 85       V85     1 0.3369526
## 86       V86     1 0.3362234
## 87       V87     1 0.3324823
## 88       V88     1 0.3363561
## 89       V89     1 0.3164451
## 90       V90     1 0.3300439
## 91       V91     1 0.3323680
## 92       V92     1 0.3295428
## 93       V93     1 0.3390083
## 94       V94     1 0.3338701
## 95       V95     1 0.3327265
## 96       V96     1 0.3372356
## 97       V97     1 0.3285018
## 98       V98     1 0.3405108
## 99       V99     1 0.3396445
## 100     V100     1 0.3431063
## 101       V1     2 0.3398901
## 102       V2     2 0.3000056
## 103       V3     2 0.3296793
## 104       V4     2 0.3395204
## 105       V5     2 0.3156779
## 106       V6     2 0.3261266
## 107       V7     2 0.3387288
## 108       V8     2 0.3418139
## 109       V9     2 0.3340251
## 110      V10     2 0.3387694
## 111      V11     2 0.3438975
## 112      V12     2 0.3302780
## 113      V13     2 0.3247523
## 114      V14     2 0.3380624
## 115      V15     2 0.3222991
## 116      V16     2 0.3295993
## 117      V17     2 0.3460957
## 118      V18     2 0.3364548
## 119      V19     2 0.3383834
## 120      V20     2 0.3361176
## 121      V21     2 0.3567490
## 122      V22     2 0.3196157
## 123      V23     2 0.3418836
## 124      V24     2 0.3242577
## 125      V25     2 0.3382427
## 126      V26     2 0.3316750
## 127      V27     2 0.3528635
## 128      V28     2 0.3294448
## 129      V29     2 0.3322899
## 130      V30     2 0.3317375
## 131      V31     2 0.3250578
## 132      V32     2 0.3397469
## 133      V33     2 0.3308793
## 134      V34     2 0.3320496
## 135      V35     2 0.3311405
## 136      V36     2 0.3353126
## 137      V37     2 0.3213659
## 138      V38     2 0.3496568
## 139      V39     2 0.3250819
## 140      V40     2 0.3340992
## 141      V41     2 0.3392494
## 142      V42     2 0.3287749
## 143      V43     2 0.3365042
## 144      V44     2 0.3246151
## 145      V45     2 0.3355978
## 146      V46     2 0.3380793
## 147      V47     2 0.3286440
## 148      V48     2 0.3406495
## 149      V49     2 0.3250026
## 150      V50     2 0.3349581
## 151      V51     2 0.3328446
## 152      V52     2 0.3126930
## 153      V53     2 0.3309328
## 154      V54     2 0.3235347
## 155      V55     2 0.3281646
## 156      V56     2 0.3394276
## 157      V57     2 0.3310803
## 158      V58     2 0.3320298
## 159      V59     2 0.3279228
## 160      V60     2 0.3512521
## 161      V61     2 0.3427828
## 162      V62     2 0.3548624
## 163      V63     2 0.3362001
## 164      V64     2 0.3296481
## 165      V65     2 0.3407480
## 166      V66     2 0.3305447
## 167      V67     2 0.3243846
## 168      V68     2 0.3296921
## 169      V69     2 0.3270872
## 170      V70     2 0.3411405
## 171      V71     2 0.3396233
## 172      V72     2 0.3357476
## 173      V73     2 0.3210400
## 174      V74     2 0.3354378
## 175      V75     2 0.3377195
## 176      V76     2 0.3508277
## 177      V77     2 0.3383982
## 178      V78     2 0.3344460
## 179      V79     2 0.3287559
## 180      V80     2 0.3329252
## 181      V81     2 0.3285095
## 182      V82     2 0.3222125
## 183      V83     2 0.3444173
## 184      V84     2 0.3413549
## 185      V85     2 0.3344004
## 186      V86     2 0.3323165
## 187      V87     2 0.3261317
## 188      V88     2 0.3310079
## 189      V89     2 0.3039971
## 190      V90     2 0.3442171
## 191      V91     2 0.3305566
## 192      V92     2 0.3272495
## 193      V93     2 0.3312396
## 194      V94     2 0.3428270
## 195      V95     2 0.3395843
## 196      V96     2 0.3311794
## 197      V97     2 0.3394708
## 198      V98     2 0.3325135
## 199      V99     2 0.3226663
## 200     V100     2 0.3385917
## 201       V1     3 0.3385695
## 202       V2     3 0.4011263
## 203       V3     3 0.3409468
## 204       V4     3 0.3256313
## 205       V5     3 0.3368140
## 206       V6     3 0.3371774
## 207       V7     3 0.3287093
## 208       V8     3 0.3322260
## 209       V9     3 0.3426091
## 210      V10     3 0.3290395
## 211      V11     3 0.3222505
## 212      V12     3 0.3344797
## 213      V13     3 0.3188085
## 214      V14     3 0.3279421
## 215      V15     3 0.3409944
## 216      V16     3 0.3353490
## 217      V17     3 0.3210060
## 218      V18     3 0.3374282
## 219      V19     3 0.3255243
## 220      V20     3 0.3434896
## 221      V21     3 0.3327855
## 222      V22     3 0.3605923
## 223      V23     3 0.3308802
## 224      V24     3 0.3343260
## 225      V25     3 0.3250165
## 226      V26     3 0.3390999
## 227      V27     3 0.3084880
## 228      V28     3 0.3364331
## 229      V29     3 0.3319589
## 230      V30     3 0.3383440
## 231      V31     3 0.3550872
## 232      V32     3 0.3096367
## 233      V33     3 0.3345759
## 234      V34     3 0.3251388
## 235      V35     3 0.3392185
## 236      V36     3 0.3283697
## 237      V37     3 0.3423308
## 238      V38     3 0.3311343
## 239      V39     3 0.3562043
## 240      V40     3 0.3299775
## 241      V41     3 0.3323583
## 242      V42     3 0.3346324
## 243      V43     3 0.3200850
## 244      V44     3 0.3287123
## 245      V45     3 0.3291374
## 246      V46     3 0.3279534
## 247      V47     3 0.3301732
## 248      V48     3 0.3206275
## 249      V49     3 0.3333665
## 250      V50     3 0.3282222
## 251      V51     3 0.3354002
## 252      V52     3 0.3269498
## 253      V53     3 0.3383574
## 254      V54     3 0.3140135
## 255      V55     3 0.3367315
## 256      V56     3 0.3287077
## 257      V57     3 0.3416056
## 258      V58     3 0.3398297
## 259      V59     3 0.3376141
## 260      V60     3 0.3243187
## 261      V61     3 0.3290750
## 262      V62     3 0.3234357
## 263      V63     3 0.3233360
## 264      V64     3 0.3337828
## 265      V65     3 0.3301177
## 266      V66     3 0.3381382
## 267      V67     3 0.3332641
## 268      V68     3 0.3398967
## 269      V69     3 0.3274433
## 270      V70     3 0.3237035
## 271      V71     3 0.3349765
## 272      V72     3 0.3370729
## 273      V73     3 0.3427712
## 274      V74     3 0.3384356
## 275      V75     3 0.3349993
## 276      V76     3 0.3317945
## 277      V77     3 0.3279208
## 278      V78     3 0.3371971
## 279      V79     3 0.3233503
## 280      V80     3 0.3343640
## 281      V81     3 0.3380190
## 282      V82     3 0.3362500
## 283      V83     3 0.3232524
## 284      V84     3 0.3274091
## 285      V85     3 0.3286470
## 286      V86     3 0.3314601
## 287      V87     3 0.3413860
## 288      V88     3 0.3326361
## 289      V89     3 0.3795578
## 290      V90     3 0.3257391
## 291      V91     3 0.3370755
## 292      V92     3 0.3432077
## 293      V93     3 0.3297521
## 294      V94     3 0.3233029
## 295      V95     3 0.3276892
## 296      V96     3 0.3315850
## 297      V97     3 0.3320274
## 298      V98     3 0.3269758
## 299      V99     3 0.3376892
## 300     V100     3 0.3183020

As we can see, the probabilities of each of the three topics within a book description are more or less matching. This shows that although the particular words define the topic, the mixture of them all is rather blended together, and the individual terms are not associated with one another.

Conclusions

In this project we have attempted clustering text data. Although the most common data when clustered is numerical, or transformed accordingly, I have decided to follow the unforeseen path. When analysis a text based dataset, preprocessing activities are of the utmost importance. Hence the text standardisation and segmentation, for the data to be ready to for cluster allocation. However, regardless of the dilligent preparation of the dataset and its necessary modifications, the results from K-Means and PAM were not of the highest quality. When conducting a post-hoc clusterability test, a poor clustering tendency was discovered. Thus, we drew on the LDA modeling. Here, we were welcomed with a clearer topic division, performing better than the standard clustering measures. However, the per-document-per-topic probabilities turned to be nearly identical in most cases. The most important remark is that: regardless of the particular words defining the topic noticeably, intra group mixing deteriorates the considerations of the individual inputs. It is clear, having applied multiple measures, that to obtain the best possible results, separate methods have to be used, with a manifold clustering tendency verification. Our dataset turned out to be a rather poor example for the clustering process. Nonetheless, it is an important representative case of poor clusterability, to emphasize the importance of a fitting dataset for the future studies, and an interesting precedent, that even if clustering is not an option, there are alternatives, such as LDA.

References

Data Novia. (2018). Assessing Clustering Tendency. Retrieved from https://www.datanovia.com/en/lessons/assessing-clustering-tendency/

Easy Guides. (2016). Assessing clustering tendency: A vital issue – Unsupervised Machine Learning. Retrieved from https://www.r-bloggers.com/2016/10/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning/

Ibrahim, L. F., & Harbi, M. H. A. (2013). Using modified partitioning around medoids clustering technique in mobile network planning. Retrieved from https://arxiv.org/pdf/1302.6602.pdf

Kulshrestha, R. (2019). A Beginner’s Guide to Latent Dirichlet Allocation (LDA). Retrieved from https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

Kumar, A. (2020). KMeans Silhouette Score Explained With Python Example. https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam

Munoz, E. (2020). Getting started with NLP: Tokenization, Document-Term Matrix, TF-IDF. Retrieved from https://medium.com/analytics-vidhya/getting-started-with-nlp-tokenization-document-term-matrix-tf-idf-2ea7d01f1942

Ponte, J. M., & Croft, W. B. (1997). Text segmentation by topic. In International Conference on Theory and Practice of Digital Libraries, 113-125. Springer, Berlin, Heidelberg. Retrieved from https://link.springer.com/chapter/10.1007/BFb0026725

Shaheen, M., Shahbaz, M., Guergachi, A., & Rehman, Z. (2011). Mining sustainability indicators to classify hydrocarbon development. Knowledge-Based Systems, 24(8), 1159-1168. Retrieved from https://www.sciencedirect.com/science/article/abs/pii/S0950705111000827

Silge, J., & Robinson, D. (2020). “Text Mining with R: A Tidy Approach”. 6 Topic modeling. Retrieved from https://www.tidytextmining.com/topicmodeling.html

Zwitch, R. (2013). Clustering Search Keywords Using K-Means Clustering. Retrieved from https://www.r-bloggers.com/2013/09/clustering-search-keywords-using-k-means-clustering/