Topic Modelling (LDA)

Lets first load the required libraries.

library(topicmodels)
library(tm)

## Loading required package: NLP

library(SnowballC)
library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(syuzhet)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidytext)
library(forcats)

Introduction

Huge collections of data, be it literature, emails, web pages or content in other digital media have grown to sizes that make it hard for humans to handle them easily. It is the computer sciences of information retrieval, text mining and natural language processing that deal with this problem. Researchers from these fields have developed the techniques that help us transform our search queries (on Google or Bing) into meaningful search results, recommend to us new books based on our previous purchases (Amazon) or automatically translate documents between languages. A new approach to extracting information from digital data are so-called topic models. As the name suggests, they are about modeling which data entities are likely to be from the same topic or subject. Topic model algorithms enable us to feed a computer program with data and have documents or other content (images or videos) assigned to topics that are meaningful to humans. The topics are the results of an unsupervised learning process which can then be used for searching or browsing the original data collection.

In this paper, I focus on the topic model latent Dirichlet allocation (Lda), which was first proposed by Blei et al. in 2003. In comparison to other topic models, LDA has the advantage of being a probabilistic model that firstly performs better than alternatives such as probabilistic latent semantic indexing (Plsi) (Blei et al., 2003) and that secondly, as a Bayesian network, is easier to extend to more specific purposes. Variations on the original LDA have led to topic models such as correlated topic models (Ctm), authortopic models (Atm) and hierarchical topic models (Htm), all of which make different assumptions about the data and with each being suited for specific analysis.

Purpose

The aim of topic modeling is to discover the themes that run through a corpus by analyzing the words of the original texts. Topic modeling provides us with methods to organize, understand and summarize large collections of textual information . It helps in discovering hidden topical patterns that are present across the collection and annotating documents according to these topics

Description of data set

lets read our csv file with the help of read.csv function. Additionally lets look at our data by using the glimpse function.

data <- read.csv("kindle_review.csv")
glimpse(data)

## Rows: 2,672
## Columns: 11
## $ X              <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Unnamed..0     <int> 11539, 5957, 9146, 7038, 1776, 3744, 13641, 4448, 2797,…
## $ asin           <chr> "B0033UV8HI", "B002HJV4DE", "B002ZG96I4", "B002QHWOEU",…
## $ helpful        <chr> "[8, 10]", "[1, 1]", "[0, 0]", "[1, 3]", "[0, 1]", "[6,…
## $ rating         <int> 3, 5, 3, 3, 4, 5, 2, 4, 5, 4, 1, 4, 1, 4, 5, 2, 4, 1, 5…
## $ reviewText     <chr> "Jace Rankin may be short, but he's nothing to mess wit…
## $ reviewTime     <chr> "09 2, 2010", "10 8, 2013", "04 11, 2014", "07 5, 2014"…
## $ reviewerID     <chr> "A3HHXRELK8BHQG", "A2RGNZ0TRF578I", "A3S0H2HV6U1I7F", "…
## $ reviewerName   <chr> "Ridley", "Holly Butler", "Merissa", "Cleargrace", "Rjo…
## $ summary        <chr> "Entertaining But Average", "Terrific menage scenes!", …
## $ unixReviewTime <int> 1283385600, 1381190400, 1397174400, 1404518400, 1356912…

From the above output we can see that we have 11 columns and 2672 observations. Columns are self explanatory. We don’t require all columns for our analysis so its better to either remove them or select only the desired columns which is in our case is:

Review Text. It contains various reviews by different customers. We will analyse only reviews to identify hidden patterns.

Cleaning Preparation of Data for Modeling

glimpse(data)

## Rows: 2,672
## Columns: 11
## $ X              <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Unnamed..0     <int> 11539, 5957, 9146, 7038, 1776, 3744, 13641, 4448, 2797,…
## $ asin           <chr> "B0033UV8HI", "B002HJV4DE", "B002ZG96I4", "B002QHWOEU",…
## $ helpful        <chr> "[8, 10]", "[1, 1]", "[0, 0]", "[1, 3]", "[0, 1]", "[6,…
## $ rating         <int> 3, 5, 3, 3, 4, 5, 2, 4, 5, 4, 1, 4, 1, 4, 5, 2, 4, 1, 5…
## $ reviewText     <chr> "Jace Rankin may be short, but he's nothing to mess wit…
## $ reviewTime     <chr> "09 2, 2010", "10 8, 2013", "04 11, 2014", "07 5, 2014"…
## $ reviewerID     <chr> "A3HHXRELK8BHQG", "A2RGNZ0TRF578I", "A3S0H2HV6U1I7F", "…
## $ reviewerName   <chr> "Ridley", "Holly Butler", "Merissa", "Cleargrace", "Rjo…
## $ summary        <chr> "Entertaining But Average", "Terrific menage scenes!", …
## $ unixReviewTime <int> 1283385600, 1381190400, 1397174400, 1404518400, 1356912…

Prior to any analysis first step is to clean and transform data according to model requirements. Usually in text data we have some special characters, white spaces or stop words. Its recommended to get rid of these things as they don’t add any value to analysis.

# using unnest_tokens()
data_review <- data %>%
  unnest_tokens(word, reviewText) # What to create (word) from where (reviewText) 
head(data_review)

##   X Unnamed..0       asin helpful rating reviewTime     reviewerID reviewerName
## 1 0      11539 B0033UV8HI [8, 10]      3 09 2, 2010 A3HHXRELK8BHQG       Ridley
## 2 0      11539 B0033UV8HI [8, 10]      3 09 2, 2010 A3HHXRELK8BHQG       Ridley
## 3 0      11539 B0033UV8HI [8, 10]      3 09 2, 2010 A3HHXRELK8BHQG       Ridley
## 4 0      11539 B0033UV8HI [8, 10]      3 09 2, 2010 A3HHXRELK8BHQG       Ridley
## 5 0      11539 B0033UV8HI [8, 10]      3 09 2, 2010 A3HHXRELK8BHQG       Ridley
## 6 0      11539 B0033UV8HI [8, 10]      3 09 2, 2010 A3HHXRELK8BHQG       Ridley
##                    summary unixReviewTime   word
## 1 Entertaining But Average     1283385600   jace
## 2 Entertaining But Average     1283385600 rankin
## 3 Entertaining But Average     1283385600    may
## 4 Entertaining But Average     1283385600     be
## 5 Entertaining But Average     1283385600  short
## 6 Entertaining But Average     1283385600    but

unnest_tokens() has done some cleaning removed punctuation and white space, transformed to lowercase etc. We can see in the above output that each row contains only one word. Means one word per row. Its a really large data set we can have a look at the total amount of rows by using the dim function.

dim(data_review)

## [1] 285930     11

In total we have 285930 rows. Lets count the words and arrange them in descending order to see which words occur more frequently.

# counting words
data_review %>%
  count(word) %>%
  arrange(desc(n)) %>%
  head()

##   word     n
## 1  the 14480
## 2  and  8360
## 3   to  7844
## 4    a  7842
## 5    i  7225
## 6   of  5891

We can still see some common words such as the, and, to a etc occurred more frequently.To analyze someone’s distinctive word use, you want to remove these words. That can be done with an anti_join to tidy text’s list of stop_words.

# using unnest_tokens() with stopwords
data_review2 <- data %>%
  unnest_tokens(word, reviewText) %>%
  anti_join(stop_words)

## Joining with `by = join_by(word)`

#head(data_review2)

lets count the words again to see if we manage to resolve the issue successfully.

# counting words again
data_review2 %>%
  count(word) %>%
  arrange(desc(n)) %>%
  head()

##         word    n
## 1       book 2779
## 2      story 2066
## 3       read 1749
## 4 characters  909
## 5       love  787
## 6      books  667

We can see from the above output that now the most frequent words are book, story and read etc which actually reflect the actual and meaningful content.

Visualization

Instead of looking at the data frame its more appealing, attractive and understandable to visualize our cleaned data. Lets count the words and arrange them in descending order.

word_counts <- data_review2 %>%
  count(word) %>%
  filter(n>300) %>%
  arrange(desc(n))

Now lets pass this word_count to ggplot function and flip the axis to see frequency of words.

# using coord_flip()
# when data are hard to read
# on the x axis
ggplot(word_counts, aes(x=word, y=n)) + 
  geom_col() +
  coord_flip() +
  ggtitle("Review Word Counts")

We can see in the above output each word against its count. Lets reorder them and then visualize them from largest to smallest.

# reorder what (word) by what (n)
word_counts <- data_review2 %>%
  count(word) %>%
  filter(n>300) %>%
  mutate(word2 = fct_reorder(word, n))

word_counts

##          word    n      word2
## 1      author  652     author
## 2        book 2779       book
## 3       books  667      books
## 4   character  335  character
## 5  characters  909 characters
## 6     enjoyed  353    enjoyed
## 7      kindle  306     kindle
## 8        love  787       love
## 9        plot  344       plot
## 10       read 1749       read
## 11    reading  582    reading
## 12    romance  367    romance
## 13     series  609     series
## 14        sex  474        sex
## 15      short  493      short
## 16    stories  372    stories
## 17      story 2066      story
## 18       time  600       time
## 19    writing  301    writing
## 20    written  309    written

# now this plot
# with new ordered column x = word2
# is arranged by word count
# and is far better to read:
ggplot(word_counts, aes(x=word2, y=n)) + 
  geom_col() +
  coord_flip() +
  ggtitle("Review Word Counts")

Topic Modeling (LDA)

After the cleaning and visualization next step is to perform Latent Dirichlet Allocation. Its an iterative algorithm that uncover topics based on word frequency, which is discrete. The intuition behind LDA is that documents usually refer to a small number of topics and topics usually based on small number of words. But first we need to create a Document Term Matrix. A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

# using as.matrix()
dtm_review <- data_review2 %>%
  count(X, word) %>%  # count each word used in each identified review 
  cast_dtm(X, word, n) %>%  # use the word counts by reviews  to create a DTM
  as.matrix()

After creating the dtm_review now lets finally perform and implement LDA.In the following code k is the number of topics we want the model to produce. Specifying the simulation seed will help us recover consistent topics.

lda_out <- LDA(
  dtm_review,
  k = 2,
  method = "Gibbs",
  control = list(seed=42)
)

We can use the glimpse function to see whats included in the lda_out object.

glimpse(lda_out)

## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
##   ..@ seedwords      : NULL
##   ..@ z              : int [1:99239] 2 2 2 1 1 1 2 1 2 1 ...
##   ..@ alpha          : num 25
##   ..@ call           : language LDA(x = dtm_review, k = 2, method = "Gibbs", control = list(seed = 42))
##   ..@ Dim            : int [1:2] 2672 16843
##   ..@ control        :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
##   ..@ k              : int 2
##   ..@ terms          : chr [1:16843] "1890s" "add" "adult" "agrees" ...
##   ..@ documents      : chr [1:2672] "0" "1" "2" "3" ...
##   ..@ beta           : num [1:2, 1:16843] -13.17 -10.73 -9.45 -7.21 -13.17 ...
##   ..@ gamma          : num [1:2672, 1:2] 0.609 0.493 0.475 0.595 0.482 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:84134] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ j   : int [1:84134] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ v   : num [1:84134] 2 2 2 1 1 1 2 1 2 1 ...
##   .. ..$ nrow: int 2672
##   .. ..$ ncol: int 16843
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num -791192
##   ..@ iter           : int 2000
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 99239

Its a really long object. However k - the number of topics that we specified and beta - the word probabilities to define the topics. Let’s evaluate LDA model output the most important LDA model output are the topics themselves i.e. a dictionary of all words in the corpus sorted according to the probability that each word occurs as a part of that topic.function tidy() takes the matrix of topic probabilities “beta”and put it into a form that is easy visualized using ggplot2

lda_topics <- lda_out %>%
  tidy(matrix = "beta")

lda_topics %>%
  arrange(desc(beta))

## # A tibble: 33,686 × 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     2 book       0.0552 
##  2     2 read       0.0346 
##  3     2 story      0.0345 
##  4     2 characters 0.0174 
##  5     1 love       0.0150 
##  6     2 books      0.0133 
##  7     2 author     0.0129 
##  8     1 series     0.0116 
##  9     2 reading    0.0111 
## 10     2 short      0.00979
## # ℹ 33,676 more rows

The above output only shows with what the topics are composed off but no direction of what topics mean. Key direction is to find topics that are each different where no topic repeats. So lets perform the modeling again

#let's do once again topic modeling with LDA()
# and gather all code together: and then
#Finally let's plot discovered topics
# we treat topic as a factor
# to add some color
lda_topics <- LDA(
  dtm_review,
  k = 2,
  method = "Gibbs",
  control = list(seed=42)
  ) %>%
  tidy(matrix = "beta")
word_probs <- lda_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  mutate(term2 = fct_reorder(term, beta))
ggplot(
  word_probs,
  aes(term2, beta, fill=as.factor(topic))
  ) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Interpretation of topics is quite subjective in nature. The first topic (topic 1) seems to be a collection of words describing some sort of a romantic story and the second topic (topic 2) is more related to book reading and story telling. Based on the intuition we can conclude that topic 1 could be titled as “Women as Romantic Creatures” and Topic 2 could be labeled as”Story Reading & Writing”.

Evaluation

Lets evaluate our LDA and try out of we can split our text into three distinct topics.

# Three topics----
# we will repeat the same steps
# of modeling LDA, tidying, grouping, reordering, and finally ploting
# but with k=3 topics

lda_topics2 <- LDA(
  dtm_review,
  k = 3,
  method = "Gibbs",
  control = list(seed=42)
) %>%
  tidy(matrix = "beta")
word_probs2 <- lda_topics2 %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  mutate(term2 = fct_reorder(term, beta))
ggplot(
  word_probs2,
  aes(term2, beta, fill=as.factor(topic))
) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

In the above figures we see that topic 1 contains words which are quite different from topic 2 and topic 3. Additionally, words collection and combination of topic 2 and topic 3 are quite similar .Note that adding topics that are different is good but if we start repeating topics, we have gone too far name the topics based on the combination of high-probability words. And this is contrary to topic 2 and topic 3. Therefore for this analysis we will stick to only two topics as already described and titled above.

Summary

The objective of this analysis was to perform LDA. After transforming and cleaning of our data set, we identified that two topics could possibly describe our data: Women as Romantic Creatures and Story Reading and Writing. Words are categorized and labeled under two distinct topics is solely based on probability of occurrence or in simple words high-probability words. The evaluation of LDA with 3 topics made no sense as repeating topics does not add anything except just naming them differently.

Topic Modelling (LDA)

Muhammad Usman PhD Scholar University of Warsaw, Poland

01/02/2025