Text Analysis with LDA on unknow topic structure data

LDA with data of unknown topics structure

Frequently when using LDA, you don’t actually know the underlying topic structure of the documents. Generally that is why you are using LDA to analyze the text in the first place. LDA is still useful in these instances, but we have to perform additional tests and analysis to confirm that the topic structure uncovered by LDA is a good structure.

Associated Press dataset

The topicmodels package includes a document-term matrix of a sample of articles published by the Associated Press in 1992. Let’s load them into R and convert them to a tidy format.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.3.3

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'ggplot2' was built under R version 3.3.3

## Warning: package 'tibble' was built under R version 3.3.3

## Warning: package 'tidyr' was built under R version 3.3.3

## Warning: package 'readr' was built under R version 3.3.3

## Warning: package 'purrr' was built under R version 3.3.3

## Warning: package 'dplyr' was built under R version 3.3.3

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(tidytext)

## Warning: package 'tidytext' was built under R version 3.3.3

library(topicmodels)

## Warning: package 'topicmodels' was built under R version 3.3.3

library(stringr)

## Warning: package 'stringr' was built under R version 3.3.3

data("AssociatedPress", package = "topicmodels")

ap_td <- tidy(AssociatedPress)
ap_td

## # A tibble: 302,031 x 3
##    document       term count
##       <int>      <chr> <dbl>
##  1        1     adding     1
##  2        1      adult     2
##  3        1        ago     1
##  4        1    alcohol     1
##  5        1  allegedly     1
##  6        1      allen     1
##  7        1 apparently     2
##  8        1   appeared     1
##  9        1   arrested     1
## 10        1    assault     1
## # ... with 302,021 more rows

AssociatedPress is originally in a document-term matrix, exactly what we need for topic modeling. Why tidy it first? Because the original dtm contains stop words - we want to remove them before modeling the data. Let’s remove the stop words, then cast the data back into a document-term matrix.

ap_dtm <- ap_td %>%
  anti_join(stop_words, by = c(term = "word")) %>%
  cast_dtm(document, term, count)
ap_dtm

## <<DocumentTermMatrix (documents: 2246, terms: 10134)>>
## Non-/sparse entries: 259208/22501756
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

Specify Number of Topics by Selecting k

Remember that for LDA, you need to specify in advance the number of topics in the underlying topic structure.

Let’s estimate an LDA model for the Associated Press articles, setting k=4

ap_lda <- LDA(ap_dtm, k = 4, control = list(seed = 11091987))
ap_lda

## A LDA_VEM topic model with 4 topics.

What do the top terms for each of these topics look like?

ap_lda_td <- tidy(ap_lda)

top_terms <- ap_lda_td %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms

## # A tibble: 20 x 3
##    topic       term        beta
##    <int>      <chr>       <dbl>
##  1     1      court 0.006218491
##  2     1     people 0.004889242
##  3     1       time 0.003290945
##  4     1   attorney 0.003153044
##  5     1      judge 0.003048817
##  6     2    percent 0.019680944
##  7     2    million 0.013124415
##  8     2    billion 0.009420669
##  9     2     market 0.007138427
## 10     2    company 0.006193477
## 11     3  president 0.009534250
## 12     3     soviet 0.009429408
## 13     3       bush 0.009127182
## 14     3 government 0.006440969
## 15     3      party 0.005518307
## 16     4     police 0.009129833
## 17     4     people 0.007842503
## 18     4  officials 0.005275858
## 19     4 government 0.004891185
## 20     4   military 0.004351113

Simple Visulization

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 2) +
  coord_flip()

Fair enough. The four topics generally look to describe:

American-Soviet relations
Crime and education
American (domestic) government
It’s the economy

How well it predicts?

Perplexity is a statistical measure of how well a probability model predicts a sample. The benefit of this statistic comes in comparing perplexity across different models with varying k selection. The model with the lowest perplexity is generally considered the “best”.

perplexity(ap_lda)

## [1] 2272.392

n_topics <- c(2, 4, 10, 20, 50, 100)
ap_lda_compare <- n_topics %>%
  map(LDA, x = ap_dtm, control = list(seed = 1109))

data_frame(k = n_topics,
           perplex = map_dbl(ap_lda_compare, perplexity)) %>%
  ggplot(aes(k, perplex)) +
  geom_point() +
  geom_line() +
  labs(title = "Evaluating LDA topic models",
       subtitle = "Optimal number of topics (smaller is better)",
       x = "Number of topics",
       y = "Perplexity")

It looks like the 100-topic model has the lowest perplexity score. What kind of topics does this generate? Let’s look just at the first 12 topics produced by the model (ggplot2 has difficulty rendering a graph for 100 separate facets):

ap_lda_td <- tidy(ap_lda_compare[[6]])

top_terms <- ap_lda_td %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms

## # A tibble: 504 x 3
##    topic      term       beta
##    <int>     <chr>      <dbl>
##  1     1   eastern 0.01722148
##  2     1  contract 0.01231454
##  3     1    pilots 0.01196664
##  4     1     court 0.01174533
##  5     1     union 0.01089688
##  6     2    soviet 0.06517539
##  7     2     union 0.01963833
##  8     2      bush 0.01639825
##  9     2    united 0.01381904
## 10     2 gorbachev 0.01347580
## # ... with 494 more rows

Everything is relative

top_terms %>%
  filter(topic <= 12) %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 3) +
  coord_flip()

This is where your intuition and domain knowledge as a researcher is important.