Text analysis: topic modeling

Objectives

Demonstrate how to use LDA to recover topic structure from a known set of topics

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.3.3

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'ggplot2' was built under R version 3.3.3

## Warning: package 'tibble' was built under R version 3.3.3

## Warning: package 'tidyr' was built under R version 3.3.3

## Warning: package 'readr' was built under R version 3.3.3

## Warning: package 'purrr' was built under R version 3.3.3

## Warning: package 'dplyr' was built under R version 3.3.3

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(gutenbergr)

## Warning: package 'gutenbergr' was built under R version 3.3.3

library(tidytext)

## Warning: package 'tidytext' was built under R version 3.3.3

library(topicmodels)

## Warning: package 'topicmodels' was built under R version 3.3.3

library(stringr)

## Warning: package 'stringr' was built under R version 3.3.3

set.seed(1234)
theme_set(theme_minimal())


titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
            "Pride and Prejudice", "Great Expectations")


books <- gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title", mirror = "ftp://aleph.gutenberg.org/")

Pre-processing

As pre-processing, we divide these into chapters, use unnest_tokens() from tidytext to separate them into words, then remove stop words. We are treating every chapter as a separate “document”, each with a name like Great Expectations_1 or Pride and Prejudice_11.

by_chapter <- books %>%
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0)

by_chapter_word <- by_chapter %>%
  unite(title_chapter, title, chapter) %>%
  unnest_tokens(word, text)

word_counts <- by_chapter_word %>%
  anti_join(stop_words) %>%
  count(title_chapter, word, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

word_counts

## # A tibble: 104,721 x 3
##               title_chapter    word     n
##                       <chr>   <chr> <int>
##  1    Great Expectations_57     joe    88
##  2     Great Expectations_7     joe    70
##  3    Great Expectations_17   biddy    63
##  4    Great Expectations_27     joe    58
##  5    Great Expectations_38 estella    58
##  6     Great Expectations_2     joe    56
##  7    Great Expectations_23  pocket    53
##  8    Great Expectations_15     joe    50
##  9    Great Expectations_18     joe    50
## 10 The War of the Worlds_16 brother    50
## # ... with 104,711 more rows

Latent Dirichlet allocation with the topicmodels package

Right now this data frame is in a tidy form, with one-term-per-document-per-row. However, the topicmodels package requires a DocumentTermMatrix (from the tm package). We can cast a one-token-per-row table into a DocumentTermMatrix with cast_dtm():

chapters_dtm <- word_counts %>%
  cast_dtm(title_chapter, word, n)

chapters_dtm

## <<DocumentTermMatrix (documents: 193, terms: 18215)>>
## Non-/sparse entries: 104721/3410774
## Sparsity           : 97%
## Maximal term length: 19
## Weighting          : term frequency (tf)

chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda

## A LDA_VEM topic model with 4 topics.

Now tidytext gives us the option of returning to a tidy analysis, using the tidy() and augment() verbs borrowed from the broom package. In particular, we start with the tidy() verb.

chapters_lda_td <- tidy(chapters_lda)
chapters_lda_td

## # A tibble: 72,860 x 3
##    topic    term         beta
##    <int>   <chr>        <dbl>
##  1     1     joe 5.830326e-17
##  2     2     joe 3.194447e-57
##  3     3     joe 4.162676e-24
##  4     4     joe 1.445030e-02
##  5     1   biddy 7.846976e-27
##  6     2   biddy 4.672244e-69
##  7     3   biddy 2.259711e-46
##  8     4   biddy 4.767972e-03
##  9     1 estella 3.827272e-06
## 10     2 estella 5.316964e-65
## # ... with 72,850 more rows

Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination the model has beta (β), the probability of that term being generated from that topic.

We could use top_n() from dplyr to find the top 5 terms within each topic:

top_terms <- chapters_lda_td %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms

## # A tibble: 20 x 3
##    topic      term        beta
##    <int>     <chr>       <dbl>
##  1     1 elizabeth 0.014107538
##  2     1     darcy 0.008814258
##  3     1      miss 0.008706741
##  4     1    bennet 0.006947431
##  5     1      jane 0.006497512
##  6     2   captain 0.015507696
##  7     2  nautilus 0.013050048
##  8     2       sea 0.008850073
##  9     2      nemo 0.008708397
## 10     2       ned 0.008030799
## 11     3    people 0.006797400
## 12     3  martians 0.006512569
## 13     3      time 0.005347115
## 14     3     black 0.005278302
## 15     3     night 0.004483143
## 16     4       joe 0.014450300
## 17     4      time 0.006847574
## 18     4       pip 0.006817363
## 19     4    looked 0.006365257
## 20     4      miss 0.006228387

Vusulization

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

These topics are pretty clearly associated with the four books: “nemo”, “sea”, and “nautilus” belongs to Twenty Thousand Leagues Under the Sea “jane”, “darcy”, and “elizabeth” belongs to Pride and Prejudice “pip” and “joe” from Great Expectations “martians”, “black”, and “night” from The War of the Worlds

Also note that LDA() does not assign any label to each topic. They are simply topics 1, 2, 3, and 4. We can infer these are associated with each book, but it is merely our inference.

Per-document classification

Each chapter was a “document” in this analysis. Thus, we may want to know which topics are associated with each document. Can we put the chapters back together in the correct books?

chapters_lda_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_lda_gamma

## # A tibble: 772 x 3
##                    document topic        gamma
##                       <chr> <int>        <dbl>
##  1    Great Expectations_57     1 1.351886e-05
##  2     Great Expectations_7     1 1.470726e-05
##  3    Great Expectations_17     1 2.117127e-05
##  4    Great Expectations_27     1 1.919746e-05
##  5    Great Expectations_38     1 3.544403e-01
##  6     Great Expectations_2     1 1.723723e-05
##  7    Great Expectations_23     1 5.507241e-01
##  8    Great Expectations_15     1 1.682503e-02
##  9    Great Expectations_18     1 1.272044e-05
## 10 The War of the Worlds_16     1 1.084337e-05
## # ... with 762 more rows

Setting matrix = “gamma” returns a tidied version with one-document-per-topic-per-row. Now that we have these document classifiations, we can see how well our unsupervised learning did at distinguishing the four books. First we re-separate the document name into title and chapter:

chapters_lda_gamma <- chapters_lda_gamma %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
chapters_lda_gamma

## # A tibble: 772 x 4
##                    title chapter topic        gamma
##  *                 <chr>   <int> <int>        <dbl>
##  1    Great Expectations      57     1 1.351886e-05
##  2    Great Expectations       7     1 1.470726e-05
##  3    Great Expectations      17     1 2.117127e-05
##  4    Great Expectations      27     1 1.919746e-05
##  5    Great Expectations      38     1 3.544403e-01
##  6    Great Expectations       2     1 1.723723e-05
##  7    Great Expectations      23     1 5.507241e-01
##  8    Great Expectations      15     1 1.682503e-02
##  9    Great Expectations      18     1 1.272044e-05
## 10 The War of the Worlds      16     1 1.084337e-05
## # ... with 762 more rows

Then we examine what fraction of chapters we got right for each:

ggplot(chapters_lda_gamma, aes(gamma, fill = factor(topic))) +
  geom_histogram() +
  facet_wrap(~ title, nrow = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Topics Identified

We notice that almost all of the chapters from Pride and Prejudice, War of the Worlds, and Twenty Thousand Leagues Under the Sea were uniquely identified as a single topic each.

chapter_classifications <- chapters_lda_gamma %>%
  group_by(title, chapter) %>%
  top_n(1, gamma) %>%
  ungroup() %>%
  arrange(gamma)

chapter_classifications

## # A tibble: 193 x 4
##                 title chapter topic     gamma
##                 <chr>   <int> <int>     <dbl>
##  1 Great Expectations      54     3 0.4803234
##  2 Great Expectations      22     4 0.5356506
##  3 Great Expectations      31     4 0.5464851
##  4 Great Expectations      23     1 0.5507241
##  5 Great Expectations      33     4 0.5700737
##  6 Great Expectations      47     4 0.5802089
##  7 Great Expectations      56     4 0.5984806
##  8 Great Expectations      38     4 0.6455341
##  9 Great Expectations      11     4 0.6689600
## 10 Great Expectations      44     4 0.6777974
## # ... with 183 more rows

Aggragate Consensus

We can determine this by finding the consensus book for each, which we note is correct based on our earlier visualization:

book_topics <- chapter_classifications %>%
  count(title, topic) %>%
  top_n(1, n) %>%
  ungroup() %>%
  transmute(consensus = title, topic)

book_topics

## # A tibble: 4 x 2
##                               consensus topic
##                                   <chr> <int>
## 1                    Great Expectations     4
## 2                   Pride and Prejudice     1
## 3                 The War of the Worlds     3
## 4 Twenty Thousand Leagues under the Sea     2

Then we see which chapters were misidentified:

chapter_classifications %>%
  inner_join(book_topics, by = "topic") %>%
  count(title, consensus) %>%
  knitr::kable()

title	consensus	n
Great Expectations	Great Expectations	57
Great Expectations	Pride and Prejudice	1
Great Expectations	The War of the Worlds	1
Pride and Prejudice	Pride and Prejudice	61
The War of the Worlds	The War of the Worlds	27
Twenty Thousand Leagues under the Sea	Twenty Thousand Leagues under the Sea	46

By word assignments: augment

One important step in the topic modeling expectation-maximization algorithm is assigning each word in each document to a topic. The more words in a document are assigned to that topic, generally, the more weight (gamma) will go on that document-topic classification.

We may want to take the original document-word pairs and find which words in each document were assigned to which topic. This is the job of the augment() verb.

We can, for example, create a “confusion matrix” using dplyr::count() and tidyr::spread:

assignments <- augment(chapters_lda, data = chapters_dtm)

assignments <- assignments %>%
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
  inner_join(book_topics, by = c(".topic" = "topic"))

assignments

## # A tibble: 104,721 x 6
##                 title chapter  term count .topic          consensus
##                 <chr>   <int> <chr> <dbl>  <dbl>              <chr>
##  1 Great Expectations      57   joe    88      4 Great Expectations
##  2 Great Expectations       7   joe    70      4 Great Expectations
##  3 Great Expectations      17   joe     5      4 Great Expectations
##  4 Great Expectations      27   joe    58      4 Great Expectations
##  5 Great Expectations       2   joe    56      4 Great Expectations
##  6 Great Expectations      23   joe     1      4 Great Expectations
##  7 Great Expectations      15   joe    50      4 Great Expectations
##  8 Great Expectations      18   joe    50      4 Great Expectations
##  9 Great Expectations       9   joe    44      4 Great Expectations
## 10 Great Expectations      13   joe    40      4 Great Expectations
## # ... with 104,711 more rows

assignments %>%
  count(title, consensus, wt = count) %>%
  spread(consensus, n, fill = 0) %>%
  knitr::kable()

title	Great Expectations	Pride and Prejudice	The War of the Worlds	Twenty Thousand Leagues under the Sea
Great Expectations	49770	3876	1845	77
Pride and Prejudice	1	37229	7	5
The War of the Worlds	0	0	22561	7
Twenty Thousand Leagues under the Sea	0	5	0	39629

We notice that almost all the words for Pride and Prejudice, Twenty Thousand Leagues Under the Sea, and War of the Worlds were correctly assigned, while Great Expectations had a fair amount of misassignment.

What were the most commonly mistaken words?

wrong_words <- assignments %>%
  filter(title != consensus)

wrong_words

## # A tibble: 4,535 x 6
##                                    title chapter     term count .topic
##                                    <chr>   <int>    <chr> <dbl>  <dbl>
##  1                    Great Expectations      38  brother     2      1
##  2                    Great Expectations      22  brother     4      1
##  3                    Great Expectations      23     miss     2      1
##  4                    Great Expectations      22     miss    23      1
##  5 Twenty Thousand Leagues under the Sea       8     miss     1      1
##  6                    Great Expectations      31     miss     1      1
##  7                    Great Expectations       5 sergeant    37      1
##  8                    Great Expectations      46  captain     1      2
##  9                    Great Expectations      32  captain     1      2
## 10                 The War of the Worlds      17  captain     5      2
## # ... with 4,525 more rows, and 1 more variables: consensus <chr>

wrong_words %>%
  count(title, consensus, term, wt = count) %>%
  ungroup() %>%
  arrange(desc(n))

## # A tibble: 3,500 x 4
##                 title             consensus     term     n
##                 <chr>                 <chr>    <chr> <dbl>
##  1 Great Expectations   Pride and Prejudice     love    44
##  2 Great Expectations   Pride and Prejudice sergeant    37
##  3 Great Expectations   Pride and Prejudice     lady    32
##  4 Great Expectations   Pride and Prejudice     miss    26
##  5 Great Expectations The War of the Worlds     boat    25
##  6 Great Expectations   Pride and Prejudice   father    19
##  7 Great Expectations The War of the Worlds    water    19
##  8 Great Expectations   Pride and Prejudice     baby    18
##  9 Great Expectations   Pride and Prejudice  flopson    18
## 10 Great Expectations   Pride and Prejudice   family    16
## # ... with 3,490 more rows

Notice the word “flopson” here; these wrong words do not necessarily appear in the novels they were misassigned to. Indeed, we can confirm “flopson” appears only in Great Expectations:

word_counts %>%
  filter(word == "flopson")

## # A tibble: 3 x 3
##           title_chapter    word     n
##                   <chr>   <chr> <int>
## 1 Great Expectations_22 flopson    10
## 2 Great Expectations_23 flopson     7
## 3 Great Expectations_33 flopson     1

The algorithm is stochastic and iterative, and it can accidentally land on a topic that spans multiple books.