Frequently when using LDA, you don’t actually know the underlying topic structure of the documents. Generally that is why you are using LDA to analyze the text in the first place. LDA is still useful in these instances, but we have to perform additional tests and analysis to confirm that the topic structure uncovered by LDA is a good structure.
The topicmodels package includes a document-term matrix of a sample of articles published by the Associated Press in 1992. Let’s load them into R and convert them to a tidy format.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.3
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.3.3
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.3.3
library(stringr)
## Warning: package 'stringr' was built under R version 3.3.3
data("AssociatedPress", package = "topicmodels")
ap_td <- tidy(AssociatedPress)
ap_td
## # A tibble: 302,031 x 3
## document term count
## <int> <chr> <dbl>
## 1 1 adding 1
## 2 1 adult 2
## 3 1 ago 1
## 4 1 alcohol 1
## 5 1 allegedly 1
## 6 1 allen 1
## 7 1 apparently 2
## 8 1 appeared 1
## 9 1 arrested 1
## 10 1 assault 1
## # ... with 302,021 more rows
AssociatedPress is originally in a document-term matrix, exactly what we need for topic modeling. Why tidy it first? Because the original dtm contains stop words - we want to remove them before modeling the data. Let’s remove the stop words, then cast the data back into a document-term matrix.
ap_dtm <- ap_td %>%
anti_join(stop_words, by = c(term = "word")) %>%
cast_dtm(document, term, count)
ap_dtm
## <<DocumentTermMatrix (documents: 2246, terms: 10134)>>
## Non-/sparse entries: 259208/22501756
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
Remember that for LDA, you need to specify in advance the number of topics in the underlying topic structure.
Let’s estimate an LDA model for the Associated Press articles, setting k=4
ap_lda <- LDA(ap_dtm, k = 4, control = list(seed = 11091987))
ap_lda
## A LDA_VEM topic model with 4 topics.
What do the top terms for each of these topics look like?
ap_lda_td <- tidy(ap_lda)
top_terms <- ap_lda_td %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
## # A tibble: 20 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 court 0.006218491
## 2 1 people 0.004889242
## 3 1 time 0.003290945
## 4 1 attorney 0.003153044
## 5 1 judge 0.003048817
## 6 2 percent 0.019680944
## 7 2 million 0.013124415
## 8 2 billion 0.009420669
## 9 2 market 0.007138427
## 10 2 company 0.006193477
## 11 3 president 0.009534250
## 12 3 soviet 0.009429408
## 13 3 bush 0.009127182
## 14 3 government 0.006440969
## 15 3 party 0.005518307
## 16 4 police 0.009129833
## 17 4 people 0.007842503
## 18 4 officials 0.005275858
## 19 4 government 0.004891185
## 20 4 military 0.004351113
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 2) +
coord_flip()
Fair enough. The four topics generally look to describe:
What happens if we set k=12k=12? How do our results change?
ap_lda <- LDA(ap_dtm, k = 12, control = list(seed = 11091987))
ap_lda
## A LDA_VEM topic model with 12 topics.
ap_lda_td <- tidy(ap_lda)
top_terms <- ap_lda_td %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
## # A tibble: 60 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 court 0.010589315
## 2 1 workers 0.010367812
## 3 1 federal 0.007757856
## 4 1 union 0.006122602
## 5 1 department 0.006025243
## 6 2 percent 0.029968251
## 7 2 market 0.013703597
## 8 2 prices 0.012567295
## 9 2 million 0.010915256
## 10 2 dollar 0.008528528
## # ... with 50 more rows
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 3) +
coord_flip()
These topics appear to be more specific, yet not as easily decodeable.
Iraq War (I) Bush’s reelection campaign Federal courts Apartheid and South Africa Crime Economy No idea Soviet Union Environment Stock market Wildfires? Bush-Congress relations
As you may see Several different values for k may be plausible, but by increasing k we sacrifice clarity.
Perplexity is a statistical measure of how well a probability model predicts a sample. The benefit of this statistic comes in comparing perplexity across different models with varying k selection. The model with the lowest perplexity is generally considered the “best”.
perplexity(ap_lda)
## [1] 2272.392
n_topics <- c(2, 4, 10, 20, 50, 100)
ap_lda_compare <- n_topics %>%
map(LDA, x = ap_dtm, control = list(seed = 1109))
data_frame(k = n_topics,
perplex = map_dbl(ap_lda_compare, perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")
It looks like the 100-topic model has the lowest perplexity score. What kind of topics does this generate? Let’s look just at the first 12 topics produced by the model (ggplot2 has difficulty rendering a graph for 100 separate facets):
ap_lda_td <- tidy(ap_lda_compare[[6]])
top_terms <- ap_lda_td %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
## # A tibble: 504 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 eastern 0.01722148
## 2 1 contract 0.01231454
## 3 1 pilots 0.01196664
## 4 1 court 0.01174533
## 5 1 union 0.01089688
## 6 2 soviet 0.06517539
## 7 2 union 0.01963833
## 8 2 bush 0.01639825
## 9 2 united 0.01381904
## 10 2 gorbachev 0.01347580
## # ... with 494 more rows
top_terms %>%
filter(topic <= 12) %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 3) +
coord_flip()
This is where your intuition and domain knowledge as a researcher is important.