This project revolves around the matter of clustering with the use of unsupervised learning measures. The goal of this article is to cluster a book based on its short description, as opposed to the typical grouping based on genre. The database in use was fully built by me in SQLite Studio and consists of 100 titles of several different genres such as Thriller, Science Fiction or Contemporary Fiction, thereby varying in main themes. All the inputs were collected with the use of the Goodreads platform (the most used book online database of books, reviews and personal libraries). I will attempt to generate the optimal number of clusters, divide the database contents respectively to the measures and draw an analysis according to it.
First, let’s load all the libraries we will use in the study.
library("knitr")
library("topicmodels")
library("tidytext")
library("dendextend")
##
## ---------------------
## Welcome to dendextend version 1.14.0
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
library("fpc")
library("ggplot2")
library("pdp")
library("cluster")
library("stats")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("h2o")
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
library("tm")
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library("stopwords")
##
## Attaching package: 'stopwords'
## The following object is masked from 'package:tm':
##
## stopwords
library("SnowballC")
library("plyr")
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library("corpus")
library("radiant.data")
## Loading required package: magrittr
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:h2o':
##
## day, hour, month, week, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Loading required package: tidyr
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
##
## Attaching package: 'radiant.data'
## The following objects are masked from 'package:lubridate':
##
## month, wday
## The following object is masked from 'package:magrittr':
##
## set_attr
## The following object is masked from 'package:h2o':
##
## month
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following object is masked from 'package:base':
##
## date
library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
In the sphere of unsupervised learning classification, the labels are not assigned to particular data inputs. The approach of unsupervised learning prompts understanding of the structure of the undisclosed data (Zwitch, 2013). Such unlabeled patterns of data are then ordered using clustering measures. An average process of clustering would entail representation of the data pattern, interpretation of the pattern closeness estimate and eventual grouping (Shaheen et al., 2011). Depending on the features of the dataset in use, different methods of clustering can be applied for the optimum outputs. When it comes to clustering text data, it is predominantly suggested to refer to k-means clustering, which segregates the inputs into k subgroups, with each particular input being allocated into the one closest in the distance to its center (Zwitch, 2013). Within the area of k-means we can distinguish k-means and spherical k-means. However, because the database in use consists of 100 observations only, k-means will be more appropriate, and hence, used in the project. Additionally to the k-means measure, partitioning around medoids (PAM) (Ibrahim & Harbi, 2013) and Latent Dirichlet Allocation (LDA) (Kulshrestha, 2019), a great method for topic modeling when it comes to text based datasets, will be applied.
In order for the data to be successfully used and analyzed we need to slightly preprocess it. However, it will not be an overly extensive process, thanks to the fact that the database was handcrafted, therefore the data set needs no cleaning, only correct reading into the R environment. Initially the database consisted of 12 variables, them being: Book ID, Title, whether it is English original, Author, Number of Pages, Publication Year, ISBN number, link to the book’s Goodreads profile, Book Description, Genre, whether it is a series, and the Book Cover. After a topic formulation, I decided to leave the following variables: Title, Description, Genre. Thus, the database consists of 3 variables and 100 different observations. There are 8 different genres in the database: * Historical Fiction * Contemporary Fiction * Thriller * Horror * Mystery/Crime * Non-Fiction * Romance * Science Fiction
opts_knit$set(root.dir = "/Users/ola/Desktop/UW/Semester 1/Unsupervised Learning/Project 1")
books <- read.csv("databasebooks.csv", header = FALSE, col.names = c("Title", "Description", "Genre"))
head(books, 3)
## Title
## 1 The Muse
## 2 The Hating Game
## 3 What Lies Between Us
## Description
## 1 On a hot July day in 1967, Odelle Bastien climbs the stone steps of the Skelton gallery in London, knowing that her life is about to change forever. Having struggled to find her place in the city since she arrived from Trinidad five years ago, she has been offered a job as a typist under the tutelage of the glamorous and enigmatic Marjorie Quick. But though Quick takes Odelle into her confidence, and unlocks a potential she didn't know she had, she remains a mystery - no more so than when a lost masterpiece with a secret history is delivered to the gallery.\n\nThe truth about the painting lies in 1936 and a large house in rural Spain, where Olive Schloss, the daughter of a renowned art dealer, is harbouring ambitions of her own. Into this fragile paradise come artist and revolutionary Isaac Robles and his half-sister Teresa, who immediately insinuate themselves into the Schloss family, with explosive and devastating consequences
## 2 Lucy Hutton has always been certain that the nice girl can get the corner office. She’s charming and accommodating and prides herself on being loved by everyone at Bexley & Gamin. Everyone except for coldly efficient, impeccably attired, physically intimidating Joshua Templeman. And the feeling is mutual.\n\nTrapped in a shared office together 40 (OK, 50 or 60) hours a week, they’ve become entrenched in an addictive, ridiculous never-ending game of one-upmanship. There’s the Staring Game. The Mirror Game. The HR Game. Lucy can’t let Joshua beat her at anything—especially when a huge new promotion goes up for the taking.\n\nIf Lucy wins this game, she’ll be Joshua’s boss. If she loses, she’ll resign. So why is she suddenly having steamy dreams about Joshua, and dressing for work like she’s got a hot date? After a perfectly innocent elevator ride ends with an earth-shattering kiss, Lucy starts to wonder whether she’s got Joshua Templeman all wrong.\n\nMaybe Lucy Hutton doesn’t hate Joshua Templeman. And maybe, he doesn’t hate her either. Or maybe this is just another game.
## 3 They say every house has its secrets, and the house that Maggie and Nina have shared for so long is no different. Except that these secrets are not buried in the past.\n\nEvery other night, Maggie and Nina have dinner together. When they are finished, Nina helps Maggie back to her room in the attic, and into the heavy chain that keeps her there. Because Maggie has done things to Nina that can’t ever be forgiven, and now she is paying the price.\n\nBut there are many things about the past that Nina doesn’t know, and Maggie is going to keep it that way—even if it kills her.\n\nBecause in this house, the truth is more dangerous than lies.
## Genre
## 1 Historical Fiction
## 2 Romance
## 3 Thriller
After reading the data in, we can briefly scan how would the titles be distributed, had we used the conventional genre-based division, as it is done in bookstores and libraries. As we have 8 genres in the data set, the inputs will be partitioned respectively. Because we deal with text data, we cannot draw means or medians, as we would with numerical data for some introductory descriptive statistics.
table(books$Genre)
##
## Contemporary Fiction Historical Fiction Horror
## 16 12 12
## Mystery/Crime Non-Fiction Romance
## 10 7 12
## Science Fiction Thriller
## 19 12
As we can see, the data is moderately evenly spread, resembling 12 positions or so. Will the clusters be evenly divided too? Let’s find out.
Now onto the data standardization. First we need to change each element of the vector in the variable Description to lower case using function tolower(). The next step is to remove the stopwords from the same variable, predefined in the function of the same name using its primary arguments. Then we utilize the same package with the function removePunctuaction, rather self explanatory. Eventually, we need to implement word stemming, to get to the roots and enable the topic modeling.
books$Description <- tolower(books$Description)
books$Description <- removeWords(books$Description, words = stopwords())
books$Description <- removePunctuation(books$Description, ucp = TRUE)
books$Description <- stemDocument(books$Description)
books_corpus = as_corpus_frame(books$Description)
The following significant step is the text segmentation - an activity of partitioning the written text into meaningful units, that can be words, topics or sentences (Ponte & Croft, 1997). We will consider a Document Term Matrix (DTM) to track the term frequency for each of them (Munoz, 2020). After the descriptions are divided, we will dismiss the words of the lowest recurrence for more efficient clustering, as we have 2454 variables (separate text inputs).
frequency_matrix <- term_matrix(books_corpus)
frequency_matrix = as.data.frame(as.matrix(frequency_matrix))
head(books_corpus, 3)
## text
## 1 hot juli day 1967 odell bastien climb stone step skelton galleri london know …
## 2 luci hutton alway certain nice girl can get corner offic s charm accommod pri…
## 3 say everi hous secret hous maggi nina share long differ except secret buri pa…
#Word elimination by lowest frequency
transposed = t(frequency_matrix)
transposed = as.data.frame(transposed)
transposed$counter = rowSums(transposed)
transposed = rownames_to_column(transposed, var = "rowname")
wordcount = as.data.frame(transposed$counter, transposed$rowname)
colnames(wordcount) <- c("count")
count(wordcount$count)
## x freq
## 1 1 1421
## 2 2 401
## 3 3 215
## 4 4 115
## 5 5 73
## 6 6 48
## 7 7 30
## 8 8 24
## 9 9 19
## 10 10 14
## 11 11 12
## 12 12 4
## 13 13 5
## 14 14 5
## 15 15 6
## 16 16 3
## 17 17 3
## 18 18 4
## 19 19 4
## 20 20 2
## 21 21 2
## 22 22 2
## 23 24 3
## 24 25 2
## 25 28 1
## 26 29 1
## 27 32 1
## 28 33 1
## 29 36 1
## 30 38 1
## 31 42 2
## 32 66 1
As the highest frequency equals 66, and the lowest is 1, there is a wide difference between the extremes. Hence, I decided to remove all the words that occur less than 4 times, thereby leaving those that occurred 4 times or more.
high_occurence = select(transposed, rowname, counter)
high_occurence %>% arrange(desc(counter)) %>% head(10)
## rowname counter
## 1 one 66
## 2 life 42
## 3 s 42
## 4 year 38
## 5 new 36
## 6 world 33
## 7 love 32
## 8 find 29
## 9 live 28
## 10 day 25
As the letter s has been placed as a third most frequent “word”, we will remove it from the dataset, as it is not giving any insight into the topic modeling.
#Deleting the words that occur less than 4 times
highest_occurence = high_occurence[high_occurence$counter >= 4 & high_occurence$rowname != "s", ]
highest_occurence %>% arrange(counter) %>% head(10)
## rowname counter
## 1 alicia 4
## 2 america 4
## 3 american 4
## 4 around 4
## 5 author 4
## 6 bear 4
## 7 began 4
## 8 best 4
## 9 better 4
## 10 biggest 4
Let’s see how the dataset looks like when we apply the counter as higher or equal to 4.
dtm_4_word = dplyr::filter(transposed, transposed$counter > 4) %>% select(-counter)
col_names_4 <- dtm_4_word$rowname
dtm_4_word <- dtm_4_word %>% select(-rowname) %>% t() %>% as.data.frame(.)
colnames(dtm_4_word) <- col_names_4
head(dtm_4_word, 8)[1:8]
## across age ago aliv allow also alway anim
## V1 0 0 1 0 0 0 0 0
## V2 0 0 0 0 0 0 1 0
## V3 0 0 0 0 0 0 0 0
## V4 0 0 0 0 0 0 0 0
## V5 0 0 0 0 0 0 0 0
## V6 0 0 1 0 0 1 0 0
## V7 0 0 0 0 0 0 0 0
## V8 0 0 0 0 0 0 0 0
We arrive at a dataframe of 100 observations and 274 variables. This is the starting point for the clustering process.
As aforementioned, we cannot draw the usual descriptive statistics on the initial dataset. However, after standardisation, text segmentation and LDA, we can see how the final word occurence is distributed.
summary(highest_occurence)
## rowname counter
## Length:388 Min. : 4.000
## Class :character 1st Qu.: 4.000
## Mode :character Median : 6.000
## Mean : 7.809
## 3rd Qu.: 9.000
## Max. :66.000
As we can see, the extremes are as shown above, the median and mean are much closer to the minimum value than to the maximum. Given the mean is higher than the median, we are dealing with a right skewed distribution.
As the database has been handmade, we don’t need to scale the date, as well as checking for missing value. We can start with examining the clustering tendency of the data, using the factoextra package. The function get_clust_tendency evaluates the clustering tendency of the data via Hopkin’s statistic and a visualisation measure, giving a dissimilarity image. Noteworthy, the more dissimilarities there are, the better for the clustering process, as it will be defined to a greater extent (Easy Guides, 2016). Moreover, the Hopkin’s statistic needs to exceed 0.75 for the dataset to be assumed clusterable. With such result, we can indicate the clustering tendency at 90% confidence interval. The null and alternative hypotheses are as follow: * Null hypothesis: the data set is uniformly distributed (no meaningful clusters) * Alternative hypothesis: the data set D is not uniformly distributed (contains meaningful clusters) (Data Novia, 2018).
To be able to check the clusterability of our data, we need to transform the text to numerics. Therefore I added a new column “word_id” to later be able to navigate on the inputs. For the get_clust_tendency function we will create a new dataframe with only counter and word_id variables.
highest_occurence$word_id <- c(1:nrow(highest_occurence))
clusterability <- select(highest_occurence, counter, word_id)
get_clust_tendency(clusterability, nrow(clusterability) - 1, graph = TRUE, gradient = list(low = "red", mid = "white", high = "blue"), seed = 123)
## $hopkins_stat
## [1] 0.8767744
##
## $plot
As we can see, Hopkin’s statistic is well above 0.75, reaching 0.877, therefore out data is highly clusterable. On the graph, the color level is propotional for the dissimilarity level between the observations. Now, knowing we will apply k-means and PAM, we can build the sillhouettes for both measures.
par(mfrow = c(1, 2))
kmeans <- fviz_nbclust(dtm_4_word, kmeans, method = "s") + ggtitle("K-Means")
pam <- fviz_nbclust(dtm_4_word, FUNcluster = cluster::pam, method = "silhouette") + ggtitle("PAM")
grid.arrange(kmeans, pam, ncol = 2, bottom = "Optimal Number of Clusters")
From the graphs we can see that for k-means the optimal number of clusters would be 3, resulting in the highest silhouette width. On the other hand, for PAM, the optimal number of clusters is 2. As the value of silhouette width varies from -1 to 1, and the higher the score, the clusters are well-separated from each other. Negative score indicates the wrong assignment of the samples, while the one close to 0 represents overlapping clusters (Kumar, 2020). Because the PAM’s silhouette width in the 2 clusters level equals 0.4, which is the highest of all, with quite good separation of clusters, we will proceed with 2 clusters.
Now that we have established we will proceed with 2 clusters, we can move onto the clusterisation process implementing both K-Means and PAM. Because K-Means Silhouette suggested 3 clusters, I will experiments and include the clustering for such scenario. Therefore, we will obtain graph from K-Means with 2 and 3 clusters, as well as PAM with 2 and 3 clusters.
kmclusters2 <- eclust(dtm_4_word, FUNcluster = "kmeans", k = 2, hc_metric = "euclidean", graph = FALSE)
kmclusters3 <- eclust(dtm_4_word, FUNcluster = "kmeans", k = 3, hc_metric = "euclidean", graph = FALSE)
pamclusters2 <- eclust(dtm_4_word, FUNcluster = "pam", k = 2, hc_metric = "euclidean", graph = FALSE)
pamclusters3 <- eclust(dtm_4_word, FUNcluster = "pam", k = 3, hc_metric = "euclidean", graph = FALSE)
#Clustering with factoextra
cluster1 <- fviz_cluster(kmclusters2, geom = c("point")) + ggtitle("K-Means with 2 Clusters")
cluster2 <- fviz_cluster(kmclusters3, geom = c("point")) + ggtitle("K-Means with 3 Clusters")
cluster3 <- fviz_cluster(pamclusters2, geom = c("point")) + ggtitle("PAM with 2 Clusters")
cluster4 <- fviz_cluster(pamclusters3, geom = c("point")) + ggtitle("PAM with 3 Clusters")
grid.arrange(cluster1, cluster2, cluster3, cluster4, ncol = 2, bottom = "Clustering with K-Means and PAM")
##Follow Up Analysis After the optimal number of clusters generation, there is not much space for changes in the k-means analysis. Therefore, the results we obtained from the fviz_nbclust function, are those we are stuck with. As we can see from the graphs, the results from PAM algorithm are rather pointless, as we are given one major cluster with some insignificant residues in the remaining clusters. With the K-Means algorithm, the situation looks a little bit different. Although we are dealing we overlapping clusters - it was unevitable due to the low average silhouette width (approximately 0.16 and 0.18 respectively for the scenarios), we have clearly defined multiple clusters, even if in the case with three clusters one of them is rather insignificant. That, however, was too expected, as the average silhouette width between 2 and three clusters was rather small, and in the first case of 2 clusters, we can clearly see that the second cluster “pulled in” the third cluster from the case of 3 clusters. Now we can examine the average widths of all clusters for k-means settings.
km1 <- eclust(dtm_4_word, FUNcluster = "kmeans", k = 2, hc_metric = "euclidean")
km2 <- eclust(dtm_4_word, FUNcluster = "kmeans", k = 3, hc_metric = "euclidean")
sill_km1 <- fviz_silhouette(km1) + ggtitle("K-Means with 2 clusters")
## cluster size ave.sil.width
## 1 1 82 0.22
## 2 2 18 -0.11
sill_km2 <- fviz_silhouette(km2) + ggtitle("K-Means with 3 clusters")
## cluster size ave.sil.width
## 1 1 1 0.00
## 2 2 20 -0.13
## 3 3 79 0.24
km1$silinfo$clus.avg.widths
## [1] 0.2158704 -0.1110941
km2$silinfo$clus.avg.widths
## [1] 0.0000000 -0.1259044 0.2352798
grid.arrange(sill_km1, sill_km2, ncol = 2 , bottom = "K-Means Cluster Widths")
As we can see from the eclust function, although the clustering activity was conducted, and our observations are divided, the clustering distribution is rather poor. In the case of 2 cluster K-means, second cluster contains less than 10 observations that are not appurtenant to cluster 1, while the rest belongs to both clusters or cluster 1 only. A substantial amount, over 90% of the observations, are allocated in cluster 1. Thus, we do not really have a successful clustering example, we are dealing with a more of defined outliers set, and that is not an anticipated result. In case of the K-means with 3 clusters case, the situation is marginally better, as more observations are fully belonging to cluster 2. Here, however, we are dealing with a single observation (V2) cluster, i.e. cluster 1. The V2 input is so distant to all the other observations, that is creates a need for an individual additional cluster. Cluster 3 is very similar to cluster 1 from the previous K-means test, holding majority of the values - here, approximately 87% of them, some being shared with cluster 2. Both widths graphs for K-Means measure give us negative values. We are facing such outcome due to the significant overlapping of clusters. For an ideal clustering situation, we would like to have well defined detached cluster, therefore, let us check the PAM widths, in case we missed an accurate algorithm application.
km3 <- eclust(dtm_4_word, FUNcluster = "pam", k = 2, hc_metric = "euclidean")
km4 <- eclust(dtm_4_word, FUNcluster = "pam", k = 3, hc_metric = "euclidean")
sill_km3 <- fviz_silhouette(km3) + ggtitle("PAM with 2 clusters")
## cluster size ave.sil.width
## 1 1 99 0.41
## 2 2 1 0.00
sill_km4 <- fviz_silhouette(km4) + ggtitle("PAM with 3 clusters")
## cluster size ave.sil.width
## 1 1 98 0.32
## 2 2 1 0.00
## 3 3 1 0.00
km3$silinfo$clus.avg.widths
## [1] 0.4089669 0.0000000
km4$silinfo$clus.avg.widths
## [1] 0.3189033 0.0000000 0.0000000
grid.arrange(sill_km3, sill_km4, ncol = 1, bottom = "PAM Cluster Widths")
Here we are also faced with a conundrum, but a different kind than in the K-Means clustering. On one side, we are not dealing with negative values for cluster widths. However, on the other hand, the main clusters are so dominant, that the other clusters could be non-existent, for both PAM scenarios. Such outputs might indicate a feature of non-clusterability, especially in the case of PAM algorithm. To further examine the attribute of clusterability, lets conduct a Duda Hart test, to see whether the data should be split into 2 clusters (one of the scenarios).
dist <- dist(dtm_4_word)
hclust_2 <- cutree(hclust(dist), 2)
dudahart2(dtm_4_word, hclust_2)
## $p.value
## [1] 8.739347e-07
##
## $dh
## [1] 0.9568937
##
## $compare
## [1] 0.971314
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
For 2 clusters division, we obtained FALSE indication for cluster1, and the dh is lower than compare, leading to a belief that our data is clusterable. However, we also obtained a p-value of 8.739347e-07, significantly higher than the 5% confidence interval. Given such p-value, we fail to reject the null hypothesis, meaning the data is rather not clusterable. Therefore, we are faced with a dilemma. Although the initial database consisted of 8 different genres, seemingly being a good set for clustering, its structure in the description variable was not really clusterable, contrary to the results of the Hopkin’s statistic. The common saying that “less is more” does not apply here. After the preprocessing of data, we need to conduct at least two tests to inspect the clusterability. One validation is not enough. More inputs are necessary for the optimal and objective answer.
Since the clusterisation with both K-Means and PAM algorithms turned out to be weak, along with the unfortunate Duda Hart test, there is no point in looking into the clusters independently.
Hence, we will move onto the LDA (Latent Dirichlet Allocation), to examine whether this method gives us sensible results. First we need to establish a number of topics, which will be the number of our genres - 8. The results of LDA will be the probabilities of inputs allocation to each of the categories given (Kulshrestha, 2019).
#Number of topics
k <- 8
#LDA by genre number
lda1 <- LDA(dtm_4_word, k = 8)
bookstopics <- as.data.frame(tidy(lda1, matrix = "beta"))
head(bookstopics, 10)
## topic term beta
## 1 1 across 0.0009025847
## 2 2 across 0.0005633805
## 3 3 across 0.0027279429
## 4 4 across 0.0005157646
## 5 5 across 0.0018480227
## 6 6 across 0.0011686842
## 7 7 across 0.0055191436
## 8 8 across 0.0020916216
## 9 1 age 0.0030153885
## 10 2 age 0.0002469744
#LDA by proposed cluster number
#2 clusters:
lda2 <- LDA(dtm_4_word, k = 2)
topics2 <- as.data.frame(tidy(lda2, matrix = "beta"))
head(topics2, 10)
## topic term beta
## 1 1 across 0.0019469616
## 2 2 across 0.0018817092
## 3 1 age 0.0054276052
## 4 2 age 0.0007114735
## 5 1 ago 0.0018673066
## 6 2 ago 0.0034879303
## 7 1 aliv 0.0029245319
## 8 2 aliv 0.0009097213
## 9 1 allow 0.0013682999
## 10 2 allow 0.0032205769
#3 clusters:
lda3 <- LDA(dtm_4_word, k = 3)
topics3 <- as.data.frame(tidy(lda3, matrix = "beta"))
head(topics3, 10)
## topic term beta
## 1 1 across 0.0015336294
## 2 2 across 0.0010727041
## 3 3 across 0.0031445971
## 4 1 age 0.0037469512
## 5 2 age 0.0004132162
## 6 3 age 0.0050500203
## 7 1 ago 0.0021376492
## 8 2 ago 0.0040160854
## 9 3 ago 0.0018754947
## 10 1 aliv 0.0034714877
Thus function turned our model into a one-topic-per-term-per-row format. In every one of the combinations, the model calculates the likelihood of such word being generated for a given topic (Silge & Robinson, 2020). For example, for the word age in topic 1, we have the probability of 0.0012665082 for that word being generated from topic 1, but a 0.0046919790 probability in topic 2. Now let’s find the 10 terms that are the most common within each case of LDA.
bookstopics_top <- bookstopics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
graph1 <- bookstopics_top %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
graph1
topics2_top <- topics2 %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
graph2 <- topics2_top %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
graph2
topics3_top <- topics3 %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
graph3 <- topics3_top %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
graph3
Such visualisation lets us comprehend all the topics that were extracted from the books’ descriptions and the importance of patricular words in each topic. In the 8 topic LDA we can see that we are mainly dealing with a different themes within similar topics. Contrarily to that, we are experiencing a differentiated division in the LDA with 3 topics. As we can see, in the first topic, we have prevailing words such as “life”, “new”, “find”, “perfect”, “love”, “friend”, what we are faced with light literature, possibly a romance or a contemporary fiction. In the second topic, we also see the inputs of a similar tone like “life”, “year”, “girl”, “love”. However, here we can also see the word “murder” and “house”. Although individually, the word “house” is not implying grim themes, paired with murder, it offsets as such. Thereby here, we are probably faced with mysteries, crimes, thrillers and horrors. In the last group we are missing any words that would imply an emotional tone, such as “love”, “perfect” or “friend”. This leads to a conclusion that in this topic non fiction pieces were collected, possibly with some historical fiction, which often is less emotionally visible in its contents. Clearly, the LDA with 3 topics is the one with well defined topics and their inter topical variation. Therefore, the further modeling and estimated will be conducted on that example.
In the foregoing LDA implementation we estimated each topic (in a few proportions) as a mixture of words. Noteworthy, the Latent Dirichlet Allocation can also model each input as a mixture of topics, the probability of “gamma”.
#3 clusters proposal:
lda3 <- LDA(dtm_4_word, k = 3)
gamma_topics3 <- as.data.frame(tidy(lda3, matrix = "gamma"))
gamma_topics3
## document topic gamma
## 1 V1 1 0.3215404
## 2 V2 1 0.2988681
## 3 V3 1 0.3293738
## 4 V4 1 0.3348483
## 5 V5 1 0.3475081
## 6 V6 1 0.3366960
## 7 V7 1 0.3325619
## 8 V8 1 0.3259601
## 9 V9 1 0.3233657
## 10 V10 1 0.3321911
## 11 V11 1 0.3338519
## 12 V12 1 0.3352423
## 13 V13 1 0.3564392
## 14 V14 1 0.3339955
## 15 V15 1 0.3367066
## 16 V16 1 0.3350518
## 17 V17 1 0.3328983
## 18 V18 1 0.3261169
## 19 V19 1 0.3360924
## 20 V20 1 0.3203928
## 21 V21 1 0.3104655
## 22 V22 1 0.3197920
## 23 V23 1 0.3272363
## 24 V24 1 0.3414163
## 25 V25 1 0.3367408
## 26 V26 1 0.3292251
## 27 V27 1 0.3386484
## 28 V28 1 0.3341220
## 29 V29 1 0.3357512
## 30 V30 1 0.3299186
## 31 V31 1 0.3198550
## 32 V32 1 0.3506165
## 33 V33 1 0.3345448
## 34 V34 1 0.3428116
## 35 V35 1 0.3296410
## 36 V36 1 0.3363177
## 37 V37 1 0.3363033
## 38 V38 1 0.3192089
## 39 V39 1 0.3187139
## 40 V40 1 0.3359234
## 41 V41 1 0.3283923
## 42 V42 1 0.3365927
## 43 V43 1 0.3434108
## 44 V44 1 0.3466726
## 45 V45 1 0.3352648
## 46 V46 1 0.3339672
## 47 V47 1 0.3411828
## 48 V48 1 0.3387230
## 49 V49 1 0.3416308
## 50 V50 1 0.3368197
## 51 V51 1 0.3317552
## 52 V52 1 0.3603571
## 53 V53 1 0.3307099
## 54 V54 1 0.3624518
## 55 V55 1 0.3351038
## 56 V56 1 0.3318647
## 57 V57 1 0.3273141
## 58 V58 1 0.3281405
## 59 V59 1 0.3344631
## 60 V60 1 0.3244291
## 61 V61 1 0.3281422
## 62 V62 1 0.3217019
## 63 V63 1 0.3404639
## 64 V64 1 0.3365691
## 65 V65 1 0.3291343
## 66 V66 1 0.3313170
## 67 V67 1 0.3423513
## 68 V68 1 0.3304112
## 69 V69 1 0.3454695
## 70 V70 1 0.3351560
## 71 V71 1 0.3254003
## 72 V72 1 0.3271794
## 73 V73 1 0.3361888
## 74 V74 1 0.3261266
## 75 V75 1 0.3272812
## 76 V76 1 0.3173777
## 77 V77 1 0.3336810
## 78 V78 1 0.3283569
## 79 V79 1 0.3478938
## 80 V80 1 0.3327108
## 81 V81 1 0.3334715
## 82 V82 1 0.3415375
## 83 V83 1 0.3323304
## 84 V84 1 0.3312360
## 85 V85 1 0.3369526
## 86 V86 1 0.3362234
## 87 V87 1 0.3324823
## 88 V88 1 0.3363561
## 89 V89 1 0.3164451
## 90 V90 1 0.3300439
## 91 V91 1 0.3323680
## 92 V92 1 0.3295428
## 93 V93 1 0.3390083
## 94 V94 1 0.3338701
## 95 V95 1 0.3327265
## 96 V96 1 0.3372356
## 97 V97 1 0.3285018
## 98 V98 1 0.3405108
## 99 V99 1 0.3396445
## 100 V100 1 0.3431063
## 101 V1 2 0.3398901
## 102 V2 2 0.3000056
## 103 V3 2 0.3296793
## 104 V4 2 0.3395204
## 105 V5 2 0.3156779
## 106 V6 2 0.3261266
## 107 V7 2 0.3387288
## 108 V8 2 0.3418139
## 109 V9 2 0.3340251
## 110 V10 2 0.3387694
## 111 V11 2 0.3438975
## 112 V12 2 0.3302780
## 113 V13 2 0.3247523
## 114 V14 2 0.3380624
## 115 V15 2 0.3222991
## 116 V16 2 0.3295993
## 117 V17 2 0.3460957
## 118 V18 2 0.3364548
## 119 V19 2 0.3383834
## 120 V20 2 0.3361176
## 121 V21 2 0.3567490
## 122 V22 2 0.3196157
## 123 V23 2 0.3418836
## 124 V24 2 0.3242577
## 125 V25 2 0.3382427
## 126 V26 2 0.3316750
## 127 V27 2 0.3528635
## 128 V28 2 0.3294448
## 129 V29 2 0.3322899
## 130 V30 2 0.3317375
## 131 V31 2 0.3250578
## 132 V32 2 0.3397469
## 133 V33 2 0.3308793
## 134 V34 2 0.3320496
## 135 V35 2 0.3311405
## 136 V36 2 0.3353126
## 137 V37 2 0.3213659
## 138 V38 2 0.3496568
## 139 V39 2 0.3250819
## 140 V40 2 0.3340992
## 141 V41 2 0.3392494
## 142 V42 2 0.3287749
## 143 V43 2 0.3365042
## 144 V44 2 0.3246151
## 145 V45 2 0.3355978
## 146 V46 2 0.3380793
## 147 V47 2 0.3286440
## 148 V48 2 0.3406495
## 149 V49 2 0.3250026
## 150 V50 2 0.3349581
## 151 V51 2 0.3328446
## 152 V52 2 0.3126930
## 153 V53 2 0.3309328
## 154 V54 2 0.3235347
## 155 V55 2 0.3281646
## 156 V56 2 0.3394276
## 157 V57 2 0.3310803
## 158 V58 2 0.3320298
## 159 V59 2 0.3279228
## 160 V60 2 0.3512521
## 161 V61 2 0.3427828
## 162 V62 2 0.3548624
## 163 V63 2 0.3362001
## 164 V64 2 0.3296481
## 165 V65 2 0.3407480
## 166 V66 2 0.3305447
## 167 V67 2 0.3243846
## 168 V68 2 0.3296921
## 169 V69 2 0.3270872
## 170 V70 2 0.3411405
## 171 V71 2 0.3396233
## 172 V72 2 0.3357476
## 173 V73 2 0.3210400
## 174 V74 2 0.3354378
## 175 V75 2 0.3377195
## 176 V76 2 0.3508277
## 177 V77 2 0.3383982
## 178 V78 2 0.3344460
## 179 V79 2 0.3287559
## 180 V80 2 0.3329252
## 181 V81 2 0.3285095
## 182 V82 2 0.3222125
## 183 V83 2 0.3444173
## 184 V84 2 0.3413549
## 185 V85 2 0.3344004
## 186 V86 2 0.3323165
## 187 V87 2 0.3261317
## 188 V88 2 0.3310079
## 189 V89 2 0.3039971
## 190 V90 2 0.3442171
## 191 V91 2 0.3305566
## 192 V92 2 0.3272495
## 193 V93 2 0.3312396
## 194 V94 2 0.3428270
## 195 V95 2 0.3395843
## 196 V96 2 0.3311794
## 197 V97 2 0.3394708
## 198 V98 2 0.3325135
## 199 V99 2 0.3226663
## 200 V100 2 0.3385917
## 201 V1 3 0.3385695
## 202 V2 3 0.4011263
## 203 V3 3 0.3409468
## 204 V4 3 0.3256313
## 205 V5 3 0.3368140
## 206 V6 3 0.3371774
## 207 V7 3 0.3287093
## 208 V8 3 0.3322260
## 209 V9 3 0.3426091
## 210 V10 3 0.3290395
## 211 V11 3 0.3222505
## 212 V12 3 0.3344797
## 213 V13 3 0.3188085
## 214 V14 3 0.3279421
## 215 V15 3 0.3409944
## 216 V16 3 0.3353490
## 217 V17 3 0.3210060
## 218 V18 3 0.3374282
## 219 V19 3 0.3255243
## 220 V20 3 0.3434896
## 221 V21 3 0.3327855
## 222 V22 3 0.3605923
## 223 V23 3 0.3308802
## 224 V24 3 0.3343260
## 225 V25 3 0.3250165
## 226 V26 3 0.3390999
## 227 V27 3 0.3084880
## 228 V28 3 0.3364331
## 229 V29 3 0.3319589
## 230 V30 3 0.3383440
## 231 V31 3 0.3550872
## 232 V32 3 0.3096367
## 233 V33 3 0.3345759
## 234 V34 3 0.3251388
## 235 V35 3 0.3392185
## 236 V36 3 0.3283697
## 237 V37 3 0.3423308
## 238 V38 3 0.3311343
## 239 V39 3 0.3562043
## 240 V40 3 0.3299775
## 241 V41 3 0.3323583
## 242 V42 3 0.3346324
## 243 V43 3 0.3200850
## 244 V44 3 0.3287123
## 245 V45 3 0.3291374
## 246 V46 3 0.3279534
## 247 V47 3 0.3301732
## 248 V48 3 0.3206275
## 249 V49 3 0.3333665
## 250 V50 3 0.3282222
## 251 V51 3 0.3354002
## 252 V52 3 0.3269498
## 253 V53 3 0.3383574
## 254 V54 3 0.3140135
## 255 V55 3 0.3367315
## 256 V56 3 0.3287077
## 257 V57 3 0.3416056
## 258 V58 3 0.3398297
## 259 V59 3 0.3376141
## 260 V60 3 0.3243187
## 261 V61 3 0.3290750
## 262 V62 3 0.3234357
## 263 V63 3 0.3233360
## 264 V64 3 0.3337828
## 265 V65 3 0.3301177
## 266 V66 3 0.3381382
## 267 V67 3 0.3332641
## 268 V68 3 0.3398967
## 269 V69 3 0.3274433
## 270 V70 3 0.3237035
## 271 V71 3 0.3349765
## 272 V72 3 0.3370729
## 273 V73 3 0.3427712
## 274 V74 3 0.3384356
## 275 V75 3 0.3349993
## 276 V76 3 0.3317945
## 277 V77 3 0.3279208
## 278 V78 3 0.3371971
## 279 V79 3 0.3233503
## 280 V80 3 0.3343640
## 281 V81 3 0.3380190
## 282 V82 3 0.3362500
## 283 V83 3 0.3232524
## 284 V84 3 0.3274091
## 285 V85 3 0.3286470
## 286 V86 3 0.3314601
## 287 V87 3 0.3413860
## 288 V88 3 0.3326361
## 289 V89 3 0.3795578
## 290 V90 3 0.3257391
## 291 V91 3 0.3370755
## 292 V92 3 0.3432077
## 293 V93 3 0.3297521
## 294 V94 3 0.3233029
## 295 V95 3 0.3276892
## 296 V96 3 0.3315850
## 297 V97 3 0.3320274
## 298 V98 3 0.3269758
## 299 V99 3 0.3376892
## 300 V100 3 0.3183020
As we can see, the probabilities of each of the three topics within a book description are more or less matching. This shows that although the particular words define the topic, the mixture of them all is rather blended together, and the individual terms are not associated with one another.
In this project we have attempted clustering text data. Although the most common data when clustered is numerical, or transformed accordingly, I have decided to follow the unforeseen path. When analysis a text based dataset, preprocessing activities are of the utmost importance. Hence the text standardisation and segmentation, for the data to be ready to for cluster allocation. However, regardless of the dilligent preparation of the dataset and its necessary modifications, the results from K-Means and PAM were not of the highest quality. When conducting a post-hoc clusterability test, a poor clustering tendency was discovered. Thus, we drew on the LDA modeling. Here, we were welcomed with a clearer topic division, performing better than the standard clustering measures. However, the per-document-per-topic probabilities turned to be nearly identical in most cases. The most important remark is that: regardless of the particular words defining the topic noticeably, intra group mixing deteriorates the considerations of the individual inputs. It is clear, having applied multiple measures, that to obtain the best possible results, separate methods have to be used, with a manifold clustering tendency verification. Our dataset turned out to be a rather poor example for the clustering process. Nonetheless, it is an important representative case of poor clusterability, to emphasize the importance of a fitting dataset for the future studies, and an interesting precedent, that even if clustering is not an option, there are alternatives, such as LDA.
Data Novia. (2018). Assessing Clustering Tendency. Retrieved from https://www.datanovia.com/en/lessons/assessing-clustering-tendency/
Easy Guides. (2016). Assessing clustering tendency: A vital issue – Unsupervised Machine Learning. Retrieved from https://www.r-bloggers.com/2016/10/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning/
Ibrahim, L. F., & Harbi, M. H. A. (2013). Using modified partitioning around medoids clustering technique in mobile network planning. Retrieved from https://arxiv.org/pdf/1302.6602.pdf
Kulshrestha, R. (2019). A Beginner’s Guide to Latent Dirichlet Allocation (LDA). Retrieved from https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
Kumar, A. (2020). KMeans Silhouette Score Explained With Python Example. https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam
Munoz, E. (2020). Getting started with NLP: Tokenization, Document-Term Matrix, TF-IDF. Retrieved from https://medium.com/analytics-vidhya/getting-started-with-nlp-tokenization-document-term-matrix-tf-idf-2ea7d01f1942
Ponte, J. M., & Croft, W. B. (1997). Text segmentation by topic. In International Conference on Theory and Practice of Digital Libraries, 113-125. Springer, Berlin, Heidelberg. Retrieved from https://link.springer.com/chapter/10.1007/BFb0026725
Shaheen, M., Shahbaz, M., Guergachi, A., & Rehman, Z. (2011). Mining sustainability indicators to classify hydrocarbon development. Knowledge-Based Systems, 24(8), 1159-1168. Retrieved from https://www.sciencedirect.com/science/article/abs/pii/S0950705111000827
Silge, J., & Robinson, D. (2020). “Text Mining with R: A Tidy Approach”. 6 Topic modeling. Retrieved from https://www.tidytextmining.com/topicmodeling.html
Zwitch, R. (2013). Clustering Search Keywords Using K-Means Clustering. Retrieved from https://www.r-bloggers.com/2013/09/clustering-search-keywords-using-k-means-clustering/