Lets first load the required libraries.
library(topicmodels)
library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(syuzhet)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(forcats)
Introduction
Huge collections of data, be it literature, emails, web pages or content in other digital media have grown to sizes that make it hard for humans to handle them easily. It is the computer sciences of information retrieval, text mining and natural language processing that deal with this problem. Researchers from these fields have developed the techniques that help us transform our search queries (on Google or Bing) into meaningful search results, recommend to us new books based on our previous purchases (Amazon) or automatically translate documents between languages. A new approach to extracting information from digital data are so-called topic models. As the name suggests, they are about modeling which data entities are likely to be from the same topic or subject. Topic model algorithms enable us to feed a computer program with data and have documents or other content (images or videos) assigned to topics that are meaningful to humans. The topics are the results of an unsupervised learning process which can then be used for searching or browsing the original data collection.
In this paper, I focus on the topic model latent Dirichlet allocation (Lda), which was first proposed by Blei et al. in 2003. In comparison to other topic models, LDA has the advantage of being a probabilistic model that firstly performs better than alternatives such as probabilistic latent semantic indexing (Plsi) (Blei et al., 2003) and that secondly, as a Bayesian network, is easier to extend to more specific purposes. Variations on the original LDA have led to topic models such as correlated topic models (Ctm), authortopic models (Atm) and hierarchical topic models (Htm), all of which make different assumptions about the data and with each being suited for specific analysis.
Purpose
The aim of topic modeling is to discover the themes that run through a corpus by analyzing the words of the original texts. Topic modeling provides us with methods to organize, understand and summarize large collections of textual information . It helps in discovering hidden topical patterns that are present across the collection and annotating documents according to these topics
Description of data set
lets read our csv file with the help of read.csv function. Additionally lets look at our data by using the glimpse function.
data <- read.csv("kindle_review.csv")
glimpse(data)
## Rows: 2,672
## Columns: 11
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Unnamed..0 <int> 11539, 5957, 9146, 7038, 1776, 3744, 13641, 4448, 2797,…
## $ asin <chr> "B0033UV8HI", "B002HJV4DE", "B002ZG96I4", "B002QHWOEU",…
## $ helpful <chr> "[8, 10]", "[1, 1]", "[0, 0]", "[1, 3]", "[0, 1]", "[6,…
## $ rating <int> 3, 5, 3, 3, 4, 5, 2, 4, 5, 4, 1, 4, 1, 4, 5, 2, 4, 1, 5…
## $ reviewText <chr> "Jace Rankin may be short, but he's nothing to mess wit…
## $ reviewTime <chr> "09 2, 2010", "10 8, 2013", "04 11, 2014", "07 5, 2014"…
## $ reviewerID <chr> "A3HHXRELK8BHQG", "A2RGNZ0TRF578I", "A3S0H2HV6U1I7F", "…
## $ reviewerName <chr> "Ridley", "Holly Butler", "Merissa", "Cleargrace", "Rjo…
## $ summary <chr> "Entertaining But Average", "Terrific menage scenes!", …
## $ unixReviewTime <int> 1283385600, 1381190400, 1397174400, 1404518400, 1356912…
From the above output we can see that we have 11 columns and 2672 observations. Columns are self explanatory. We don’t require all columns for our analysis so its better to either remove them or select only the desired columns which is in our case is:
Cleaning Preparation of Data for Modeling
glimpse(data)
## Rows: 2,672
## Columns: 11
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Unnamed..0 <int> 11539, 5957, 9146, 7038, 1776, 3744, 13641, 4448, 2797,…
## $ asin <chr> "B0033UV8HI", "B002HJV4DE", "B002ZG96I4", "B002QHWOEU",…
## $ helpful <chr> "[8, 10]", "[1, 1]", "[0, 0]", "[1, 3]", "[0, 1]", "[6,…
## $ rating <int> 3, 5, 3, 3, 4, 5, 2, 4, 5, 4, 1, 4, 1, 4, 5, 2, 4, 1, 5…
## $ reviewText <chr> "Jace Rankin may be short, but he's nothing to mess wit…
## $ reviewTime <chr> "09 2, 2010", "10 8, 2013", "04 11, 2014", "07 5, 2014"…
## $ reviewerID <chr> "A3HHXRELK8BHQG", "A2RGNZ0TRF578I", "A3S0H2HV6U1I7F", "…
## $ reviewerName <chr> "Ridley", "Holly Butler", "Merissa", "Cleargrace", "Rjo…
## $ summary <chr> "Entertaining But Average", "Terrific menage scenes!", …
## $ unixReviewTime <int> 1283385600, 1381190400, 1397174400, 1404518400, 1356912…
Prior to any analysis first step is to clean and transform data according to model requirements. Usually in text data we have some special characters, white spaces or stop words. Its recommended to get rid of these things as they don’t add any value to analysis.
# using unnest_tokens()
data_review <- data %>%
unnest_tokens(word, reviewText) # What to create (word) from where (reviewText)
head(data_review)
## X Unnamed..0 asin helpful rating reviewTime reviewerID reviewerName
## 1 0 11539 B0033UV8HI [8, 10] 3 09 2, 2010 A3HHXRELK8BHQG Ridley
## 2 0 11539 B0033UV8HI [8, 10] 3 09 2, 2010 A3HHXRELK8BHQG Ridley
## 3 0 11539 B0033UV8HI [8, 10] 3 09 2, 2010 A3HHXRELK8BHQG Ridley
## 4 0 11539 B0033UV8HI [8, 10] 3 09 2, 2010 A3HHXRELK8BHQG Ridley
## 5 0 11539 B0033UV8HI [8, 10] 3 09 2, 2010 A3HHXRELK8BHQG Ridley
## 6 0 11539 B0033UV8HI [8, 10] 3 09 2, 2010 A3HHXRELK8BHQG Ridley
## summary unixReviewTime word
## 1 Entertaining But Average 1283385600 jace
## 2 Entertaining But Average 1283385600 rankin
## 3 Entertaining But Average 1283385600 may
## 4 Entertaining But Average 1283385600 be
## 5 Entertaining But Average 1283385600 short
## 6 Entertaining But Average 1283385600 but
unnest_tokens() has done some cleaning removed punctuation and white space, transformed to lowercase etc. We can see in the above output that each row contains only one word. Means one word per row. Its a really large data set we can have a look at the total amount of rows by using the dim function.
dim(data_review)
## [1] 285930 11
In total we have 285930 rows. Lets count the words and arrange them in descending order to see which words occur more frequently.
# counting words
data_review %>%
count(word) %>%
arrange(desc(n)) %>%
head()
## word n
## 1 the 14480
## 2 and 8360
## 3 to 7844
## 4 a 7842
## 5 i 7225
## 6 of 5891
We can still see some common words such as the, and, to a etc occurred more frequently.To analyze someone’s distinctive word use, you want to remove these words. That can be done with an anti_join to tidy text’s list of stop_words.
# using unnest_tokens() with stopwords
data_review2 <- data %>%
unnest_tokens(word, reviewText) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
#head(data_review2)
lets count the words again to see if we manage to resolve the issue successfully.
# counting words again
data_review2 %>%
count(word) %>%
arrange(desc(n)) %>%
head()
## word n
## 1 book 2779
## 2 story 2066
## 3 read 1749
## 4 characters 909
## 5 love 787
## 6 books 667
We can see from the above output that now the most frequent words are book, story and read etc which actually reflect the actual and meaningful content.
Visualization
Instead of looking at the data frame its more appealing, attractive and understandable to visualize our cleaned data. Lets count the words and arrange them in descending order.
word_counts <- data_review2 %>%
count(word) %>%
filter(n>300) %>%
arrange(desc(n))
Now lets pass this word_count to ggplot function and flip the axis to see frequency of words.
# using coord_flip()
# when data are hard to read
# on the x axis
ggplot(word_counts, aes(x=word, y=n)) +
geom_col() +
coord_flip() +
ggtitle("Review Word Counts")
We can see in the above output each word against its count. Lets reorder them and then visualize them from largest to smallest.
# reorder what (word) by what (n)
word_counts <- data_review2 %>%
count(word) %>%
filter(n>300) %>%
mutate(word2 = fct_reorder(word, n))
word_counts
## word n word2
## 1 author 652 author
## 2 book 2779 book
## 3 books 667 books
## 4 character 335 character
## 5 characters 909 characters
## 6 enjoyed 353 enjoyed
## 7 kindle 306 kindle
## 8 love 787 love
## 9 plot 344 plot
## 10 read 1749 read
## 11 reading 582 reading
## 12 romance 367 romance
## 13 series 609 series
## 14 sex 474 sex
## 15 short 493 short
## 16 stories 372 stories
## 17 story 2066 story
## 18 time 600 time
## 19 writing 301 writing
## 20 written 309 written
# now this plot
# with new ordered column x = word2
# is arranged by word count
# and is far better to read:
ggplot(word_counts, aes(x=word2, y=n)) +
geom_col() +
coord_flip() +
ggtitle("Review Word Counts")
Topic Modeling (LDA)
After the cleaning and visualization next step is to perform Latent Dirichlet Allocation. Its an iterative algorithm that uncover topics based on word frequency, which is discrete. The intuition behind LDA is that documents usually refer to a small number of topics and topics usually based on small number of words. But first we need to create a Document Term Matrix. A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
# using as.matrix()
dtm_review <- data_review2 %>%
count(X, word) %>% # count each word used in each identified review
cast_dtm(X, word, n) %>% # use the word counts by reviews to create a DTM
as.matrix()
After creating the dtm_review now lets finally perform and implement LDA.In the following code k is the number of topics we want the model to produce. Specifying the simulation seed will help us recover consistent topics.
lda_out <- LDA(
dtm_review,
k = 2,
method = "Gibbs",
control = list(seed=42)
)
We can use the glimpse function to see whats included in the lda_out object.
glimpse(lda_out)
## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
## ..@ seedwords : NULL
## ..@ z : int [1:99239] 2 2 2 1 1 1 2 1 2 1 ...
## ..@ alpha : num 25
## ..@ call : language LDA(x = dtm_review, k = 2, method = "Gibbs", control = list(seed = 42))
## ..@ Dim : int [1:2] 2672 16843
## ..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
## ..@ k : int 2
## ..@ terms : chr [1:16843] "1890s" "add" "adult" "agrees" ...
## ..@ documents : chr [1:2672] "0" "1" "2" "3" ...
## ..@ beta : num [1:2, 1:16843] -13.17 -10.73 -9.45 -7.21 -13.17 ...
## ..@ gamma : num [1:2672, 1:2] 0.609 0.493 0.475 0.595 0.482 ...
## ..@ wordassignments:List of 5
## .. ..$ i : int [1:84134] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ j : int [1:84134] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ v : num [1:84134] 2 2 2 1 1 1 2 1 2 1 ...
## .. ..$ nrow: int 2672
## .. ..$ ncol: int 16843
## .. ..- attr(*, "class")= chr "simple_triplet_matrix"
## ..@ loglikelihood : num -791192
## ..@ iter : int 2000
## ..@ logLiks : num(0)
## ..@ n : int 99239
Its a really long object. However k - the number of topics that we specified and beta - the word probabilities to define the topics. Let’s evaluate LDA model output the most important LDA model output are the topics themselves i.e. a dictionary of all words in the corpus sorted according to the probability that each word occurs as a part of that topic.function tidy() takes the matrix of topic probabilities “beta”and put it into a form that is easy visualized using ggplot2
lda_topics <- lda_out %>%
tidy(matrix = "beta")
lda_topics %>%
arrange(desc(beta))
## # A tibble: 33,686 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 2 book 0.0552
## 2 2 read 0.0346
## 3 2 story 0.0345
## 4 2 characters 0.0174
## 5 1 love 0.0150
## 6 2 books 0.0133
## 7 2 author 0.0129
## 8 1 series 0.0116
## 9 2 reading 0.0111
## 10 2 short 0.00979
## # ℹ 33,676 more rows
The above output only shows with what the topics are composed off but no direction of what topics mean. Key direction is to find topics that are each different where no topic repeats. So lets perform the modeling again
#let's do once again topic modeling with LDA()
# and gather all code together: and then
#Finally let's plot discovered topics
# we treat topic as a factor
# to add some color
lda_topics <- LDA(
dtm_review,
k = 2,
method = "Gibbs",
control = list(seed=42)
) %>%
tidy(matrix = "beta")
word_probs <- lda_topics %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
mutate(term2 = fct_reorder(term, beta))
ggplot(
word_probs,
aes(term2, beta, fill=as.factor(topic))
) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
Interpretation of topics is quite subjective in nature. The first topic (topic 1) seems to be a collection of words describing some sort of a romantic story and the second topic (topic 2) is more related to book reading and story telling. Based on the intuition we can conclude that topic 1 could be titled as “Women as Romantic Creatures” and Topic 2 could be labeled as”Story Reading & Writing”.
Evaluation
Lets evaluate our LDA and try out of we can split our text into three distinct topics.
# Three topics----
# we will repeat the same steps
# of modeling LDA, tidying, grouping, reordering, and finally ploting
# but with k=3 topics
lda_topics2 <- LDA(
dtm_review,
k = 3,
method = "Gibbs",
control = list(seed=42)
) %>%
tidy(matrix = "beta")
word_probs2 <- lda_topics2 %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
mutate(term2 = fct_reorder(term, beta))
ggplot(
word_probs2,
aes(term2, beta, fill=as.factor(topic))
) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
In the above figures we see that topic 1 contains words which are quite different from topic 2 and topic 3. Additionally, words collection and combination of topic 2 and topic 3 are quite similar .Note that adding topics that are different is good but if we start repeating topics, we have gone too far name the topics based on the combination of high-probability words. And this is contrary to topic 2 and topic 3. Therefore for this analysis we will stick to only two topics as already described and titled above.
Summary
The objective of this analysis was to perform LDA. After transforming and cleaning of our data set, we identified that two topics could possibly describe our data: Women as Romantic Creatures and Story Reading and Writing. Words are categorized and labeled under two distinct topics is solely based on probability of occurrence or in simple words high-probability words. The evaluation of LDA with 3 topics made no sense as repeating topics does not add anything except just naming them differently.