In analyzing text data, we attempt to derive meaning from a collection of documents. Primarily, this involves statistically separating these documents into groups based on their characteristics. One popular method of achieving this is through Topic Modeling. Topic modeling is an unsupervised classification method for extracting topics from collections of documents, where the topics or groups are unobserved. The statistical method driving the classification of documents to topics is a form of natural language processing known as Latent Dirichlet Allocation, or LDA.
The driving principles behind the LDA algorithm are as follows:
1.) A document can be represented as a mixture of topics
2.) Every topic can be represented by a mixture of words
Leveraging these principles, we can see how each document can be analyzed based on the likelihood that it belongs to a certain topic, based on the mixture of words that it contains.
For this tutorial we will be using the following packages:
library(harrypotter) #harry potter books 1-7
library(topicmodels) #topic modeling functions
library(stringr) #common string functions
library(tidytext) #tidy text analysis
library(tidyverse) #data manipulation and visualization
library(scales) #used for percent scale on confusion table
For the purpose of this tutorial, we will be using the collection of Harry Potter novels. You may install these from devtools::install_github("bradleyboehmke/harrypotter"). This package holds the collection of Harry Potter books 1-7 by J.K. Rowling.
The harrypotter package makes indexing the text by chapter easy, as each book is already separated into its chapters. We create a tibble of the document, which is a combined column of title and chapter, and a text column which contains the corpus.
ChamSec <- rowid_to_column(tibble(text = chamber_of_secrets, title = "Chamber Secrets"), var = "chapter")
DeathHall <- rowid_to_column(tibble(text = deathly_hallows, title = "Deathly Hallows"), var = "chapter")
GobFire <- rowid_to_column(tibble(text = goblet_of_fire, title = "Goblet Fire"), var = "chapter")
by_chapter <- rbind(ChamSec, DeathHall, GobFire)
by_chapter <- by_chapter %>%
unite(document, c("title", "chapter"), sep = "_", remove = TRUE)
We can then use unnest_tokens to take each individual word from the corpus. However, there are many words from which we can not derive meaning used frequently in english sentence construction (i.e. “the”, “as”, “and”, “of”). These words, called stop_words in the tidytext package, are systematically removed using anti_join.
# split into words
by_chapter_word <- by_chapter %>%
unnest_tokens(word, text)
# find document-word counts
word_counts <- by_chapter_word %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE) %>%
ungroup()
top_n(word_counts, 10)
## # A tibble: 10 x 3
## document word n
## <chr> <chr> <int>
## 1 Chamber Secrets_19 harry 173
## 2 Goblet Fire_31 harry 161
## 3 Goblet Fire_26 harry 159
## 4 Goblet Fire_28 harry 152
## 5 Goblet Fire_20 harry 144
## 6 Goblet Fire_23 harry 144
## 7 Goblet Fire_18 harry 131
## 8 Goblet Fire_19 harry 129
## 9 Deathly Hallows_31 harry 123
## 10 Deathly Hallows_36 harry 118
The top words from each title, separated into chapters, and their frequency are stored in the word_counts tibble. It is not uncommon for the most common terms in your documents to be names and other proper nouns. These may or may not add value to the meaning we are trying to extract from our data. In the case of this collection, the most frequent word for each chapter is, unsurprisingly, “harry”. This would add very little meaning as an extracted topic named “harry” would be present overwhelmingly in every document. We can add custom terms to our stopwords with:
stopwords <- add_row(stop_words, word = c("harry","hermione", "ron"), lexicon = c("SMART", "SMART", "SMART"))
# find document-word counts
word_counts <- by_chapter_word %>%
anti_join(stopwords) %>%
count(document, word, sort = TRUE) %>%
ungroup()
These topics (sans main character names) could prove more insightful to the different topics between the book and chapters. We can see a more diverse set of most frequent terms.
top_n(word_counts, 10)
## # A tibble: 11 x 3
## document word n
## <chr> <chr> <int>
## 1 Deathly Hallows_33 snape 113
## 2 Goblet Fire_36 dumbledore 95
## 3 Deathly Hallows_10 kreacher 87
## 4 Goblet Fire_21 dobby 77
## 5 Goblet Fire_30 dumbledore 76
## 6 Chamber Secrets_2 dobby 65
## 7 Deathly Hallows_35 dumbledore 63
## 8 Goblet Fire_24 hagrid 63
## 9 Deathly Hallows_23 greyback 61
## 10 Deathly Hallows_24 wand 61
## 11 Goblet Fire_27 sirius 61
As of now we can only analyze how frequent each term is in each document, therefore futher processing is required. We must transform our term frequencies into a document-term matrix (dtm). This is a aptly-named matrix where the columns are documents and rows are terms. The values of this matrix are the respective frequency (or weighting) of each of the terms. We are looking for high sparsity in the matrix, as this will allow us to more effectively find natural grouping of the data (or topics).
chapters_dtm <- word_counts %>%
cast_dtm(document, word, n)
chapters_dtm
## <<DocumentTermMatrix (documents: 93, terms: 16669)>>
## Non-/sparse entries: 95345/1454872
## Sparsity : 94%
## Maximal term length: 24
## Weighting : term frequency (tf)
Now that we have a dtm object constructed, we can continue with the implementing the driving force of topic modeling, Latent Dirichlet Allocation (LDA). LDA finds the mixture of words that make up each topic and the mixture of topics that make up each document. Based on the words in each document, we can then determine how likely each document is associated with a topic. We can choose any number of topics, but since we are looking for differences in topics between the books, we can start with three. We can see that each topic is associated with a number of words, with a respective beta indicating how much it contributes to that topic.
chapters_lda <- LDA(chapters_dtm, k = 3, control = list(seed = 1234))
chapter_topics <- tidy(chapters_lda, matrix = "beta")
top_n(chapter_topics, 10)
## # A tibble: 10 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 dumbledore 0.007482439
## 2 3 dumbledore 0.008641894
## 3 3 hagrid 0.006486753
## 4 1 wand 0.009288281
## 5 2 wand 0.006326102
## 6 1 voldemort 0.008025518
## 7 3 professor 0.006485808
## 8 2 weasley 0.009745903
## 9 2 looked 0.006322032
## 10 3 looked 0.006693723
top_terms <- chapter_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
The top words that make up each topic are shown with their respective contribution (beta) to that topic. At this point, we can note some similarities between the top words in each topic. For example, “dumbledore” and “wand” appear in the top 5 terms of two topics. Topics contain the same words, but the extent to which they contribute to a given topic differs. Since our corpus belong two three books in a series, many significant words may be shared between topics (as there is a common theme between all Harry Potter books). This will come into play when we wish to determine our classification accuracy.
Until now, we’eve been looking at the construction of topics from words, given their beta (how frequent they are in each topic). Next, we would like to know how much each document is associated with each topic. The metric for this is “gamma” or the per-document-per-topic probability. In essence, gamma is the proportion of the document that is made up of words from the assigned topic. For example, the document Deathly Hallows_33, or chapter 33 of the title “Deathly Hallows” is assigned to topic 1 with a proportion of almost 10% made up of words from that topic. We use the tidy function here to portray information from our statistical model in a one-token-per-row format for ease of interpretability.
chapters_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_gamma <- chapters_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
top_n(chapters_gamma, 10)
## # A tibble: 10 x 4
## title chapter topic gamma
## <chr> <int> <int> <dbl>
## 1 Deathly Hallows 36 1 0.9999615
## 2 Deathly Hallows 26 2 0.9999586
## 3 Goblet Fire 9 2 0.9999585
## 4 Deathly Hallows 8 2 0.9999563
## 5 Goblet Fire 8 2 0.9999568
## 6 Deathly Hallows 15 2 0.9999554
## 7 Goblet Fire 31 3 0.9999622
## 8 Goblet Fire 28 3 0.9999616
## 9 Goblet Fire 26 3 0.9999645
## 10 Goblet Fire 23 3 0.9999642
We may inspect visually how well our unsupervised learning was able to distinguish between the topics for each of the titles. Accomplishing this through a boxplot, we can make a few observations. Ideally, we would like the box and whiskers to be distinguished from eachother for each of the titles, which may be the case if we were looking at titles that are not of the same franchise. However, because these titles inherently contain many of the same themes (being of the same franchise), we notice some overlap in the titles “Deathly Hallows” and “Goblet of Fire”. In contrast, “Chamber of Secrets” remains relatively distinguished between the topics, being most associated with topic 3. With highly related titles, we will see that many documents are mixtures of topics to varying degrees, and rarely associates with only one topic.
chapters_gamma %>%
mutate(title = reorder(title, gamma * topic)) %>%
ggplot(aes(factor(topic), gamma)) +
geom_boxplot() +
facet_wrap(~ title)
The goal of Unsupervised Classification is to be able to assign each document with the topic is most likely belongs to. This topic assigned is generally the topic which contains the highest proportion of words in the document, or in this case chapter. We can then classify each of the titles as a belonging to a specific topic. The consensus title is determined by which topic is most present in the chapters of the title. For example, it appears that the title “Deathly Hallows” contains mostly chapters made up of words most likely to be associated with topic 1. From this we can determine how many of the chapters have been misidentified. In this case, 47 chapters of the 93 have been misidentified. This is better than a shot-in-the-dark approach, however, we can explore further using a confusion table to determine why this model performed so poorly.
chapter_classifications <- chapters_gamma %>%
group_by(title, chapter) %>%
top_n(1, gamma) %>%
ungroup()
book_topics <- chapter_classifications %>%
count(title, topic) %>%
group_by(title) %>%
top_n(1, n) %>%
ungroup() %>%
transmute(consensus = title, topic)
chapter_classifications %>%
inner_join(book_topics, by = "topic") %>%
filter(title != consensus)
## # A tibble: 47 x 5
## title chapter topic gamma consensus
## <chr> <int> <int> <dbl> <chr>
## 1 Chamber Secrets 19 1 0.5666512 Deathly Hallows
## 2 Goblet Fire 33 1 0.9354642 Deathly Hallows
## 3 Chamber Secrets 4 1 0.3846612 Deathly Hallows
## 4 Goblet Fire 34 1 0.9999036 Deathly Hallows
## 5 Goblet Fire 4 1 0.8677189 Deathly Hallows
## 6 Goblet Fire 3 1 0.8858103 Deathly Hallows
## 7 Chamber Secrets 5 1 0.7037823 Deathly Hallows
## 8 Chamber Secrets 1 1 0.9998946 Deathly Hallows
## 9 Chamber Secrets 18 1 0.6589321 Deathly Hallows
## 10 Goblet Fire 32 1 0.8925594 Deathly Hallows
## # ... with 37 more rows
We can then use augment function to add the count of each term next to the topic the term has been assigned to. This way, we know how much each word weighs in on the topic assignment of the chapter, and ultimately title, as a whole.
assignments <- augment(chapters_lda, data = chapters_dtm)
top_n(assignments, 10)
## # A tibble: 39,115 x 4
## document term count .topic
## <chr> <chr> <dbl> <dbl>
## 1 Goblet Fire_36 snape 15 3
## 2 Goblet Fire_30 snape 7 3
## 3 Goblet Fire_27 snape 52 3
## 4 Goblet Fire_31 snape 2 3
## 5 Goblet Fire_25 snape 41 3
## 6 Goblet Fire_35 snape 11 3
## 7 Goblet Fire_28 snape 11 3
## 8 Goblet Fire_14 snape 5 3
## 9 Goblet Fire_26 snape 14 3
## 10 Goblet Fire_13 snape 2 3
## # ... with 39,105 more rows
assignments <- assignments %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
inner_join(book_topics, by = c(".topic" = "topic"))
assignments
## # A tibble: 108,869 x 6
## title chapter term count .topic consensus
## <chr> <int> <chr> <dbl> <dbl> <chr>
## 1 Deathly Hallows 33 snape 113 1 Deathly Hallows
## 2 Goblet Fire 36 snape 15 3 Chamber Secrets
## 3 Goblet Fire 36 snape 15 3 Goblet Fire
## 4 Deathly Hallows 10 snape 2 1 Deathly Hallows
## 5 Goblet Fire 30 snape 7 3 Chamber Secrets
## 6 Goblet Fire 30 snape 7 3 Goblet Fire
## 7 Deathly Hallows 35 snape 1 1 Deathly Hallows
## 8 Goblet Fire 27 snape 52 3 Chamber Secrets
## 9 Goblet Fire 27 snape 52 3 Goblet Fire
## 10 Goblet Fire 31 snape 2 3 Chamber Secrets
## # ... with 108,859 more rows
Visualizing a confusion table, we can investigate the proportion of assignments from one titles were assigned to another title. We can confirm what we predicted with the boxplot. Many of the topics were closely related to multiple titles (reflected by the overlapping boxplots earlier). Many of the “Goblet of Fire” and “Chamber of Secrets” assignments were misassigned to eachother. Only “Goblet of Fire” was accuractely disinguished by topic from the other titles enough to be few topic misassignments.
assignments %>%
count(title, consensus, wt = count) %>%
group_by(title) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(consensus, title, fill = percent)) +
geom_tile() +
scale_fill_gradient2(high = "red", label = percent_format()) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank()) +
labs(x = "Book words were assigned to",
y = "Book words came from",
fill = "% of assignments")
Which terms were the culprits? We can arrange the words that were most frequent in the misassigned topics to get to the bottom of it. As we would suspect, reoccuring terms throughout the franchise like “dumbledore”, “hagrid”, and “looked” made it difficult for LDA (as it is a stochastic algorithm) to distinguish consensus topics for each of the documents.
wrong_words <- assignments %>%
filter(title != consensus)
wrong_words %>%
count(title, consensus, term, wt = count) %>%
ungroup() %>%
arrange(desc(n)) %>%
top_n(10)
## # A tibble: 10 x 4
## title consensus term n
## <chr> <chr> <chr> <dbl>
## 1 Goblet Fire Chamber Secrets dumbledore 513
## 2 Goblet Fire Chamber Secrets looked 358
## 3 Goblet Fire Chamber Secrets hagrid 324
## 4 Goblet Fire Chamber Secrets moody 308
## 5 Goblet Fire Chamber Secrets professor 285
## 6 Goblet Fire Chamber Secrets crouch 224
## 7 Goblet Fire Chamber Secrets eyes 218
## 8 Goblet Fire Chamber Secrets snape 209
## 9 Goblet Fire Chamber Secrets time 204
## 10 Goblet Fire Chamber Secrets cedric 200
For further practice, try using the same method on other titles. Project Gutenberg provides many classic titles to choose from. In addition, the package gutenbergr allows for these titles to be loaded into r with minimal preprocessing. For Example, for the selection of titles below:
library(gutenbergr)
titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
"Pride and Prejudice", "Great Expectations")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title")
# divide into documents, each representing one chapter
by_chapter <- books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
Using other titles, answer the following questions?
What happens as you assign more and more topics? Less topics? (try more and less topics than the number of titles being analyzed)
When you choose two titles that are too closely related (like we experienced in the tutorial), how accurate is your model? What if your titles vary greatly?
Robinson, David. 2017. broom: Convert Statistical Analysis Objects into Tidy Data Frames. https://CRAN.R-project.org/package=broom.
Robinson, David. 2017. Text Mining with R: A Tidy Approach. http://tidytextmining.com/topicmodeling.html