Topic Modeling and LDA

In analyzing text data, we attempt to derive meaning from a collection of documents. Primarily, this involves statistically separating these documents into groups based on their characteristics. One popular method of achieving this is through Topic Modeling. Topic modeling is an unsupervised classification method for extracting topics from collections of documents, where the topics or groups are unobserved. The statistical method driving the classification of documents to topics is a form of natural language processing known as Latent Dirichlet Allocation, or LDA.

The driving principles behind the LDA algorithm are as follows:

1.) A document can be represented as a mixture of topics

2.) Every topic can be represented by a mixture of words

Leveraging these principles, we can see how each document can be analyzed based on the likelihood that it belongs to a certain topic, based on the mixture of words that it contains.

Prerequisites

For this tutorial we will be using the following packages:

library(harrypotter) #harry potter books 1-7
library(topicmodels) #topic modeling functions
library(stringr) #common string functions
library(tidytext) #tidy text analysis
library(tidyverse) #data manipulation and visualization
library(scales) #used for percent scale on confusion table

For the purpose of this tutorial, we will be using the collection of Harry Potter novels. You may install these from devtools::install_github("bradleyboehmke/harrypotter"). This package holds the collection of Harry Potter books 1-7 by J.K. Rowling.

Preprocessing

The harrypotter package makes indexing the text by chapter easy, as each book is already separated into its chapters. We create a tibble of the document, which is a combined column of title and chapter, and a text column which contains the corpus.

ChamSec <- rowid_to_column(tibble(text = chamber_of_secrets, title = "Chamber Secrets"), var = "chapter")
DeathHall <- rowid_to_column(tibble(text = deathly_hallows, title = "Deathly Hallows"), var = "chapter")
GobFire <- rowid_to_column(tibble(text = goblet_of_fire, title = "Goblet Fire"), var = "chapter")

by_chapter <- rbind(ChamSec, DeathHall, GobFire)
by_chapter <- by_chapter %>%
  unite(document, c("title", "chapter"), sep = "_", remove = TRUE)

We can then use unnest_tokens to take each individual word from the corpus. However, there are many words from which we can not derive meaning used frequently in english sentence construction (i.e. “the”, “as”, “and”, “of”). These words, called stop_words in the tidytext package, are systematically removed using anti_join.

# split into words
by_chapter_word <- by_chapter %>%
  unnest_tokens(word, text)

# find document-word counts
word_counts <- by_chapter_word %>%
  anti_join(stop_words) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()

top_n(word_counts, 10)
## # A tibble: 10 x 3
##              document  word     n
##                 <chr> <chr> <int>
##  1 Chamber Secrets_19 harry   173
##  2     Goblet Fire_31 harry   161
##  3     Goblet Fire_26 harry   159
##  4     Goblet Fire_28 harry   152
##  5     Goblet Fire_20 harry   144
##  6     Goblet Fire_23 harry   144
##  7     Goblet Fire_18 harry   131
##  8     Goblet Fire_19 harry   129
##  9 Deathly Hallows_31 harry   123
## 10 Deathly Hallows_36 harry   118

The top words from each title, separated into chapters, and their frequency are stored in the word_counts tibble. It is not uncommon for the most common terms in your documents to be names and other proper nouns. These may or may not add value to the meaning we are trying to extract from our data. In the case of this collection, the most frequent word for each chapter is, unsurprisingly, “harry”. This would add very little meaning as an extracted topic named “harry” would be present overwhelmingly in every document. We can add custom terms to our stopwords with:

stopwords <- add_row(stop_words, word = c("harry","hermione", "ron"), lexicon = c("SMART", "SMART", "SMART"))

# find document-word counts
word_counts <- by_chapter_word %>%
  anti_join(stopwords) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()

These topics (sans main character names) could prove more insightful to the different topics between the book and chapters. We can see a more diverse set of most frequent terms.

top_n(word_counts, 10)
## # A tibble: 11 x 3
##              document       word     n
##                 <chr>      <chr> <int>
##  1 Deathly Hallows_33      snape   113
##  2     Goblet Fire_36 dumbledore    95
##  3 Deathly Hallows_10   kreacher    87
##  4     Goblet Fire_21      dobby    77
##  5     Goblet Fire_30 dumbledore    76
##  6  Chamber Secrets_2      dobby    65
##  7 Deathly Hallows_35 dumbledore    63
##  8     Goblet Fire_24     hagrid    63
##  9 Deathly Hallows_23   greyback    61
## 10 Deathly Hallows_24       wand    61
## 11     Goblet Fire_27     sirius    61

As of now we can only analyze how frequent each term is in each document, therefore futher processing is required. We must transform our term frequencies into a document-term matrix (dtm). This is a aptly-named matrix where the columns are documents and rows are terms. The values of this matrix are the respective frequency (or weighting) of each of the terms. We are looking for high sparsity in the matrix, as this will allow us to more effectively find natural grouping of the data (or topics).

chapters_dtm <- word_counts %>%
  cast_dtm(document, word, n)

chapters_dtm
## <<DocumentTermMatrix (documents: 93, terms: 16669)>>
## Non-/sparse entries: 95345/1454872
## Sparsity           : 94%
## Maximal term length: 24
## Weighting          : term frequency (tf)

LDA

Now that we have a dtm object constructed, we can continue with the implementing the driving force of topic modeling, Latent Dirichlet Allocation (LDA). LDA finds the mixture of words that make up each topic and the mixture of topics that make up each document. Based on the words in each document, we can then determine how likely each document is associated with a topic. We can choose any number of topics, but since we are looking for differences in topics between the books, we can start with three. We can see that each topic is associated with a number of words, with a respective beta indicating how much it contributes to that topic.

chapters_lda <- LDA(chapters_dtm, k = 3, control = list(seed = 1234))

chapter_topics <- tidy(chapters_lda, matrix = "beta")

top_n(chapter_topics, 10)
## # A tibble: 10 x 3
##    topic       term        beta
##    <int>      <chr>       <dbl>
##  1     1 dumbledore 0.007482439
##  2     3 dumbledore 0.008641894
##  3     3     hagrid 0.006486753
##  4     1       wand 0.009288281
##  5     2       wand 0.006326102
##  6     1  voldemort 0.008025518
##  7     3  professor 0.006485808
##  8     2    weasley 0.009745903
##  9     2     looked 0.006322032
## 10     3     looked 0.006693723
top_terms <- chapter_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

The top words that make up each topic are shown with their respective contribution (beta) to that topic. At this point, we can note some similarities between the top words in each topic. For example, “dumbledore” and “wand” appear in the top 5 terms of two topics. Topics contain the same words, but the extent to which they contribute to a given topic differs. Since our corpus belong two three books in a series, many significant words may be shared between topics (as there is a common theme between all Harry Potter books). This will come into play when we wish to determine our classification accuracy.

Gamma

Until now, we’eve been looking at the construction of topics from words, given their beta (how frequent they are in each topic). Next, we would like to know how much each document is associated with each topic. The metric for this is “gamma” or the per-document-per-topic probability. In essence, gamma is the proportion of the document that is made up of words from the assigned topic. For example, the document Deathly Hallows_33, or chapter 33 of the title “Deathly Hallows” is assigned to topic 1 with a proportion of almost 10% made up of words from that topic. We use the tidy function here to portray information from our statistical model in a one-token-per-row format for ease of interpretability.

chapters_gamma <- tidy(chapters_lda, matrix = "gamma")

chapters_gamma <- chapters_gamma %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)

top_n(chapters_gamma, 10)
## # A tibble: 10 x 4
##              title chapter topic     gamma
##              <chr>   <int> <int>     <dbl>
##  1 Deathly Hallows      36     1 0.9999615
##  2 Deathly Hallows      26     2 0.9999586
##  3     Goblet Fire       9     2 0.9999585
##  4 Deathly Hallows       8     2 0.9999563
##  5     Goblet Fire       8     2 0.9999568
##  6 Deathly Hallows      15     2 0.9999554
##  7     Goblet Fire      31     3 0.9999622
##  8     Goblet Fire      28     3 0.9999616
##  9     Goblet Fire      26     3 0.9999645
## 10     Goblet Fire      23     3 0.9999642

We may inspect visually how well our unsupervised learning was able to distinguish between the topics for each of the titles. Accomplishing this through a boxplot, we can make a few observations. Ideally, we would like the box and whiskers to be distinguished from eachother for each of the titles, which may be the case if we were looking at titles that are not of the same franchise. However, because these titles inherently contain many of the same themes (being of the same franchise), we notice some overlap in the titles “Deathly Hallows” and “Goblet of Fire”. In contrast, “Chamber of Secrets” remains relatively distinguished between the topics, being most associated with topic 3. With highly related titles, we will see that many documents are mixtures of topics to varying degrees, and rarely associates with only one topic.

chapters_gamma %>%
  mutate(title = reorder(title, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ title)

Classification

The goal of Unsupervised Classification is to be able to assign each document with the topic is most likely belongs to. This topic assigned is generally the topic which contains the highest proportion of words in the document, or in this case chapter. We can then classify each of the titles as a belonging to a specific topic. The consensus title is determined by which topic is most present in the chapters of the title. For example, it appears that the title “Deathly Hallows” contains mostly chapters made up of words most likely to be associated with topic 1. From this we can determine how many of the chapters have been misidentified. In this case, 47 chapters of the 93 have been misidentified. This is better than a shot-in-the-dark approach, however, we can explore further using a confusion table to determine why this model performed so poorly.

chapter_classifications <- chapters_gamma %>%
  group_by(title, chapter) %>%
  top_n(1, gamma) %>%
  ungroup()

book_topics <- chapter_classifications %>%
  count(title, topic) %>%
  group_by(title) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = title, topic)

chapter_classifications %>%
  inner_join(book_topics, by = "topic") %>%
  filter(title != consensus) 
## # A tibble: 47 x 5
##              title chapter topic     gamma       consensus
##              <chr>   <int> <int>     <dbl>           <chr>
##  1 Chamber Secrets      19     1 0.5666512 Deathly Hallows
##  2     Goblet Fire      33     1 0.9354642 Deathly Hallows
##  3 Chamber Secrets       4     1 0.3846612 Deathly Hallows
##  4     Goblet Fire      34     1 0.9999036 Deathly Hallows
##  5     Goblet Fire       4     1 0.8677189 Deathly Hallows
##  6     Goblet Fire       3     1 0.8858103 Deathly Hallows
##  7 Chamber Secrets       5     1 0.7037823 Deathly Hallows
##  8 Chamber Secrets       1     1 0.9998946 Deathly Hallows
##  9 Chamber Secrets      18     1 0.6589321 Deathly Hallows
## 10     Goblet Fire      32     1 0.8925594 Deathly Hallows
## # ... with 37 more rows

We can then use augment function to add the count of each term next to the topic the term has been assigned to. This way, we know how much each word weighs in on the topic assignment of the chapter, and ultimately title, as a whole.

assignments <- augment(chapters_lda, data = chapters_dtm)
top_n(assignments, 10)
## # A tibble: 39,115 x 4
##          document  term count .topic
##             <chr> <chr> <dbl>  <dbl>
##  1 Goblet Fire_36 snape    15      3
##  2 Goblet Fire_30 snape     7      3
##  3 Goblet Fire_27 snape    52      3
##  4 Goblet Fire_31 snape     2      3
##  5 Goblet Fire_25 snape    41      3
##  6 Goblet Fire_35 snape    11      3
##  7 Goblet Fire_28 snape    11      3
##  8 Goblet Fire_14 snape     5      3
##  9 Goblet Fire_26 snape    14      3
## 10 Goblet Fire_13 snape     2      3
## # ... with 39,105 more rows
assignments <- assignments %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
  inner_join(book_topics, by = c(".topic" = "topic"))

assignments
## # A tibble: 108,869 x 6
##              title chapter  term count .topic       consensus
##              <chr>   <int> <chr> <dbl>  <dbl>           <chr>
##  1 Deathly Hallows      33 snape   113      1 Deathly Hallows
##  2     Goblet Fire      36 snape    15      3 Chamber Secrets
##  3     Goblet Fire      36 snape    15      3     Goblet Fire
##  4 Deathly Hallows      10 snape     2      1 Deathly Hallows
##  5     Goblet Fire      30 snape     7      3 Chamber Secrets
##  6     Goblet Fire      30 snape     7      3     Goblet Fire
##  7 Deathly Hallows      35 snape     1      1 Deathly Hallows
##  8     Goblet Fire      27 snape    52      3 Chamber Secrets
##  9     Goblet Fire      27 snape    52      3     Goblet Fire
## 10     Goblet Fire      31 snape     2      3 Chamber Secrets
## # ... with 108,859 more rows

Visualizing a confusion table, we can investigate the proportion of assignments from one titles were assigned to another title. We can confirm what we predicted with the boxplot. Many of the topics were closely related to multiple titles (reflected by the overlapping boxplots earlier). Many of the “Goblet of Fire” and “Chamber of Secrets” assignments were misassigned to eachother. Only “Goblet of Fire” was accuractely disinguished by topic from the other titles enough to be few topic misassignments.

assignments %>%
  count(title, consensus, wt = count) %>%
  group_by(title) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, title, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Book words were assigned to",
       y = "Book words came from",
       fill = "% of assignments")

Which terms were the culprits? We can arrange the words that were most frequent in the misassigned topics to get to the bottom of it. As we would suspect, reoccuring terms throughout the franchise like “dumbledore”, “hagrid”, and “looked” made it difficult for LDA (as it is a stochastic algorithm) to distinguish consensus topics for each of the documents.

wrong_words <- assignments %>%
  filter(title != consensus)

wrong_words %>%
  count(title, consensus, term, wt = count) %>%
  ungroup() %>%
  arrange(desc(n)) %>%
  top_n(10)
## # A tibble: 10 x 4
##          title       consensus       term     n
##          <chr>           <chr>      <chr> <dbl>
##  1 Goblet Fire Chamber Secrets dumbledore   513
##  2 Goblet Fire Chamber Secrets     looked   358
##  3 Goblet Fire Chamber Secrets     hagrid   324
##  4 Goblet Fire Chamber Secrets      moody   308
##  5 Goblet Fire Chamber Secrets  professor   285
##  6 Goblet Fire Chamber Secrets     crouch   224
##  7 Goblet Fire Chamber Secrets       eyes   218
##  8 Goblet Fire Chamber Secrets      snape   209
##  9 Goblet Fire Chamber Secrets       time   204
## 10 Goblet Fire Chamber Secrets     cedric   200

Exercises

For further practice, try using the same method on other titles. Project Gutenberg provides many classic titles to choose from. In addition, the package gutenbergr allows for these titles to be loaded into r with minimal preprocessing. For Example, for the selection of titles below:

library(gutenbergr)

titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
            "Pride and Prejudice", "Great Expectations")

books <- gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title")

# divide into documents, each representing one chapter
by_chapter <- books %>%
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter)

Using other titles, answer the following questions?

  1. What happens as you assign more and more topics? Less topics? (try more and less topics than the number of titles being analyzed)

  2. When you choose two titles that are too closely related (like we experienced in the tutorial), how accurate is your model? What if your titles vary greatly?

References

Robinson, David. 2017. broom: Convert Statistical Analysis Objects into Tidy Data Frames. https://CRAN.R-project.org/package=broom.

Robinson, David. 2017. Text Mining with R: A Tidy Approach. http://tidytextmining.com/topicmodeling.html