Topic modeling

1. Importing datasets

library(tm)

## Loading required package: NLP

reviews = read.csv("~/Box/Teaching (jmmejia@iu.edu)/2020 - K-513/Public Files/Public Data/movie_reviews.csv", stringsAsFactors = F)
review_corpus = Corpus(VectorSource(reviews$review))
review_corpus

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 50000

2. Preliminaries

library(tm)

data("acq")
# Description
#This dataset holds 50 news articles with additional meta information from the Reuters-21578 data set. All documents belong to the topic acq dealing with corporate acquisitions.
acq

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 50

inspect(acq[1:2]) # metadata

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## $`reut-00001.xml`
## <<PlainTextDocument>>
## Metadata:  15
## Content:  chars: 1287
## 
## $`reut-00002.xml`
## <<PlainTextDocument>>
## Metadata:  15
## Content:  chars: 784

inspect(acq[[1]]) # details

## <<PlainTextDocument>>
## Metadata:  15
## Content:  chars: 1287
## 
## Computer Terminal Systems Inc said
## it has completed the sale of 200,000 shares of its common
## stock, and warrants to acquire an additional one mln shares, to
## <Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs.
##     The company said the warrants are exercisable for five
## years at a purchase price of .125 dlrs per share.
##     Computer Terminal said Sedio also has the right to buy
## additional shares and increase its total holdings up to 40 pct
## of the Computer Terminal's outstanding common stock under
## certain circumstances involving change of control at the
## company.
##     The company said if the conditions occur the warrants would
## be exercisable at a price equal to 75 pct of its common stock's
## market price at the time, not to exceed 1.50 dlrs per share.
##     Computer Terminal also said it sold the technolgy rights to
## its Dot Matrix impact technology, including any future
## improvements, to <Woodco Inc> of Houston, Tex. for 200,000
## dlrs. But, it said it would continue to be the exclusive
## worldwide licensee of the technology for Woodco.
##     The company said the moves were part of its reorganization
## plan and would help pay current operation costs and ensure
## product delivery.
##     Computer Terminal makes computer generated labels, forms,
## tags and ticket printers and terminals.
##  Reuter

# Vs.

inspect(review_corpus[3:4]) # metadata

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
## 
## [1] I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.
## [2] Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.

inspect(review_corpus[[1]]) # details

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1761
## 
## One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.

Cleaning up your data using the tm package

This comes from: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

This vignette gives a short introduction to text mining in R utilizing the text mining framework provided by the tm package. We present methods for data import, corpus handling, preprocessing, metadata management, and creation of term-document matrices. Our focus is on the main aspects of getting started with text mining in R—an in-depth description of the text mining infrastructure offered by tm was published in the Journal of Statistical Software (Feinerer et al., 2008). An introductory article on text mining in R was published in R News (Feinerer, 2008).

Eliminating Extra Whitespace

Extra whitespace is eliminated by:

review_corpus <- tm_map(review_corpus, stripWhitespace)

## Warning in tm_map.SimpleCorpus(review_corpus, stripWhitespace): transformation
## drops documents

inspect(review_corpus[1:2])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
## 
## [1] One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.
## [2] A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.

Convert to Lower Case

Conversion to lower case by:

review_corpus <- tm_map(review_corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(review_corpus, content_transformer(tolower)):
## transformation drops documents

inspect(review_corpus[1:2])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
## 
## [1] one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.<br /><br />the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.<br /><br />it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />i would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. forget pretty pictures painted for mainstream audiences, forget charm, forget romance...oz doesn't mess around. the first episode i ever saw struck me as so nasty it was surreal, i couldn't say i was ready for it, but as i watched more, i developed a taste for oz, and got accustomed to the high levels of graphic violence. not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) watching oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.
## [2] a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master's of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell's murals decorating every surface) are terribly well done.

We can use arbitrary character processing functions as transformations as long as the function returns a text document. In this case we use content_transformer() which provides a convenience wrapper to access and set the content of a document. Consequently most text manipulation functions from base R can directly be usedwith this wrapper. This works for tolower() as used here but also e.g. for gsub() which comes quite handy for a broad range of text manipulation tasks.

Remove Stopwords

Removal of stopwords by:

review_corpus <- tm_map(review_corpus, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(review_corpus, removeWords,
## stopwords("english")): transformation drops documents

inspect(review_corpus[1:2])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
## 
## [1] one    reviewers  mentioned   watching just 1 oz episode   hooked.   right,    exactly  happened  .<br /><br /> first thing  struck   oz   brutality  unflinching scenes  violence,  set  right   word go. trust ,     show   faint hearted  timid.  show pulls  punches  regards  drugs, sex  violence.   hardcore,   classic use   word.<br /><br />  called oz     nickname given   oswald maximum security state penitentary.  focuses mainly  emerald city,  experimental section   prison    cells  glass fronts  face inwards,  privacy   high   agenda. em city  home  many..aryans, muslims, gangstas, latinos, christians, italians, irish  .... scuffles, death stares, dodgy dealings  shady agreements  never far away.<br /><br />  say  main appeal   show  due   fact   goes   shows  dare. forget pretty pictures painted  mainstream audiences, forget charm, forget romance...oz  mess around.  first episode  ever saw struck    nasty   surreal,   say   ready  ,    watched ,  developed  taste  oz,  got accustomed   high levels  graphic violence.  just violence,  injustice (crooked guards 'll  sold    nickel, inmates 'll kill  order  get away  , well mannered, middle class inmates  turned  prison bitches due   lack  street skills  prison experience) watching oz,  may become comfortable    uncomfortable viewing....thats   can get  touch   darker side.
## [2]  wonderful little production. <br /><br /> filming technique   unassuming-  old-time-bbc fashion  gives  comforting,  sometimes discomforting, sense  realism   entire piece. <br /><br /> actors  extremely well chosen- michael sheen   " got   polari"      voices  pat !  can truly see  seamless editing guided   references  williams' diary entries,     well worth  watching     terrificly written  performed piece.  masterful production  one   great master's  comedy   life. <br /><br /> realism really comes home   little things:  fantasy   guard , rather  use  traditional 'dream' techniques remains solid  disappears.  plays   knowledge   senses, particularly   scenes concerning orton  halliwell   sets (particularly   flat  halliwell's murals decorating every surface)  terribly well done.

Stemming

Stemming is done by:

review_corpus <- tm_map(review_corpus, stemDocument)

## Warning in tm_map.SimpleCorpus(review_corpus, stemDocument): transformation
## drops documents

inspect(review_corpus[1:2])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
## 
## [1] one review mention watch just 1 oz episod hooked. right, exact happen .<br /><br /> first thing struck oz brutal unflinch scene violence, set right word go. trust , show faint heart timid. show pull punch regard drugs, sex violence. hardcore, classic use word.<br /><br /> call oz nicknam given oswald maximum secur state penitentary. focus main emerald city, experiment section prison cell glass front face inwards, privaci high agenda. em citi home many..aryans, muslims, gangstas, latinos, christians, italians, irish .... scuffles, death stares, dodgi deal shadi agreement never far away.<br /><br /> say main appeal show due fact goe show dare. forget pretti pictur paint mainstream audiences, forget charm, forget romance...oz mess around. first episod ever saw struck nasti surreal, say readi , watch , develop tast oz, got accustom high level graphic violence. just violence, injustic (crook guard ll sold nickel, inmat ll kill order get away , well mannered, middl class inmat turn prison bitch due lack street skill prison experience) watch oz, may becom comfort uncomfort viewing....that can get touch darker side.
## [2] wonder littl production. <br /><br /> film techniqu unassuming- old-time-bbc fashion give comforting, sometim discomforting, sens realism entir piece. <br /><br /> actor extrem well chosen- michael sheen " got polari" voic pat ! can truli see seamless edit guid refer william diari entries, well worth watch terrif written perform piece. master product one great master comedi life. <br /><br /> realism realli come home littl things: fantasi guard , rather use tradit dream techniqu remain solid disappears. play knowledg senses, particular scene concern orton halliwel set (particular flat halliwel mural decor everi surface) terribl well done.

3. The Document-Term Matrix using tm

review_dtm <- DocumentTermMatrix(review_corpus)
review_dtm

## <<DocumentTermMatrix (documents: 50000, terms: 97110)>>
## Non-/sparse entries: 4751792/4850748208
## Sparsity           : 100%
## Maximal term length: 72
## Weighting          : term frequency (tf)

Reducing the size of the DTM

review_dtm = removeSparseTerms(review_dtm, 0.90)
review_dtm

## <<DocumentTermMatrix (documents: 50000, terms: 124)>>
## Non-/sparse entries: 1187272/5012728
## Sparsity           : 81%
## Maximal term length: 8
## Weighting          : term frequency (tf)

inspect(review_dtm[1,1:20])

## <<DocumentTermMatrix (documents: 1, terms: 20)>>
## Non-/sparse entries: 20/0
## Sparsity           : 0%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs around can ever first get high just right say show
##    1      1   1    1     2   2    2    2     2   2    4

A word cloud

library(wordcloud)

## Loading required package: RColorBrewer

findFreqTerms(review_dtm, 1000)

##   [1] "around"   "can"      "ever"     "fact"     "far"      "first"   
##   [7] "get"      "got"      "high"     "just"     "may"      "never"   
##  [13] "one"      "pretti"   "right"    "saw"      "say"      "scene"   
##  [19] "set"      "show"     "thing"    "turn"     "use"      "watch"   
##  [25] "well"     "actor"    "come"     "done"     "everi"    "film"    
##  [31] "give"     "great"    "life"     "littl"    "old"      "perform" 
##  [37] "play"     "realli"   "see"      "time"     "wonder"   "charact" 
##  [43] "even"     "interest" "love"     "mani"     "plot"     "point"   
##  [49] "still"    "thought"  "way"      "year"     "young"    "like"    
##  [55] "make"     "movi"     "movie"    "must"     "real"     "think"   
##  [61] "act"      "anoth"    "best"     "big"      "cast"     "director"
##  [67] "find"     "good"     "know"     "live"     "look"     "new"     
##  [73] "peopl"    "seem"     "tell"     "work"     "world"    "believ"  
##  [79] "last"     "seen"     "stori"    "back"     "need"     "quit"    
##  [85] "sure"     "almost"   "also"     "bad"      "hard"     "made"    
##  [91] "now"      "star"     "bit"      "end"      "least"    "will"    
##  [97] "actual"   "better"   "someth"   "tri"      "much"     "story"   
## [103] "enjoy"    "feel"     "long"     "start"    "take"     "whole"   
## [109] "man"      "lot"      "want"     "enough"   "script"   "guy"     
## [115] "direct"   "music"    "part"     "role"     "though"   "noth"    
## [121] "put"      "two"      "day"      "without"

freq = data.frame(sort(colSums(as.matrix(review_dtm)), decreasing=TRUE))
wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))

## Warning in brewer.pal(1, "Dark2"): minimal value for n is 3, returning requested palette with 3 different levels

Tidying a document-term matrix

Content coming from: https://www.tidytextmining.com/dtm.html

One of the most common structures that text mining packages work with is the document-term matrix (or DTM). This is a matrix where:

each row represents one document (such as a book or article),
each column represents one term, and
each value (typically) contains the number of appearances of that term in that document.

Since most pairings of document and term do not occur (they have the value zero), DTMs are usually implemented as sparse matrices. These objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format. We’ll discuss several implementations of these matrices in this chapter.

DTM objects cannot be used directly with tidy tools, just as tidy data frames cannot be used as input for most text mining packages. Thus, the tidytext package provides two verbs that convert between the two formats.

tidy() turns a document-term matrix into a tidy data frame. This verb comes from the broom package (Robinson 2017), which provides similar tidying functions for many statistical models and objects.
cast() turns a tidy one-term-per-row data frame into a matrix. tidytext provides three variations of this verb, each converting to a different type of matrix: cast_sparse() (converting to a sparse matrix from the Matrix package), cast_dtm() (converting to a DocumentTermMatrix object from tm), and cast_dfm() (converting to a dfm object from quanteda).

As shown in Figure 5.1, a DTM is typically comparable to a tidy data frame after a count or a group_by/summarize that contains counts or another statistic for each combination of a term and document.

help(tm)

## No documentation for 'tm' in specified packages and libraries:
## you could try '??tm'

3. Topic modeling

Examples coming from: https://www.tidytextmining.com/topicmodeling.html

In text mining, we often have collections of documents, such as blog posts or news articles, that we’d like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

As Figure 6.1 shows, we can use tidy text principles to approach topic modeling with the same set of tidy tools we’ve used throughout this book. In this chapter, we’ll learn to work with LDA objects from the topicmodels package, particularly tidying such models so that they can be manipulated with ggplot2 and dplyr. We’ll also explore an example of clustering chapters from several books, where we can see that a topic model “learns” to tell the difference between the four books based on the text content.

6.1 Latent Dirichlet allocation

Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without diving into the math behind the model, we can understand it as being guided by two principles.

Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. There are a number of existing implementations of this algorithm, and we’ll explore one of them in depth.

In Chapter 5 we briefly introduced the AssociatedPress dataset provided by the topicmodels package, as an example of a DocumentTermMatrix. This is a collection of 2246 news articles from an American news agency, mostly published around 1988.

The associated press dataset

library(topicmodels)

## Warning: package 'topicmodels' was built under R version 3.5.2

data("AssociatedPress")
AssociatedPress

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

Running LDA

We can use the LDA() function from the topicmodels package, setting k = 2, to create a two-topic LDA model.

Almost any topic model in practice will use a larger k, but we will soon see that this analysis approach extends to a larger number of topics.

This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents.

# set a seed so that the output of the model is predictable
ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234))
ap_lda

## A LDA_VEM topic model with 2 topics.

Fitting the model was the “easy part”: the rest of the analysis will involve exploring and interpreting the model using tidying functions from the tidytext package.

6.1.1 Word-topic probabilities

In Chapter 5 we introduced the tidy() method, originally from the broom package (Robinson 2017), for tidying model objects. The tidytext package provides this method for extracting the per-topic-per-word probabilities, called β (“beta”), from the model.

library(tidytext)

## Warning: package 'tidytext' was built under R version 3.5.2

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics

## # A tibble: 20,946 x 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 aaron      1.69e-12
##  2     2 aaron      3.90e- 5
##  3     1 abandon    2.65e- 5
##  4     2 abandon    3.99e- 5
##  5     1 abandoned  1.39e- 4
##  6     2 abandoned  5.88e- 5
##  7     1 abandoning 2.45e-33
##  8     2 abandoning 2.34e- 5
##  9     1 abbott     2.13e- 6
## 10     2 abbott     2.97e- 5
## # … with 20,936 more rows

Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic. For example, the term “aaron” has a 1.686917×10−12 probability of being generated from topic 1, but a 3.8959408×10−5 probability of being generated from topic 2.

We could use dplyr’s top_n() to find the 10 terms that are most common within each topic. As a tidy data frame, this lends itself well to a ggplot2 visualization (Figure 6.2).

Visualizing the top terms

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

This visualization lets us understand the two topics that were extracted from the articles. The most common words in topic 1 include “percent”, “million”, “billion”, and “company”, which suggests it may represent business or financial news. Those most common in topic 2 include “president”, “government”, and “soviet”, suggesting that this topic represents political news. One important observation about the words in each topic is that some words, such as “new” and “people”, are common within both topics. This is an advantage of topic modeling as opposed to “hard clustering” methods: topics used in natural language could have some overlap in terms of words.

As an alternative, we could consider the terms that had the greatest difference in
β between topic 1 and topic 2. This can be estimated based on the log ratio of the two:
log 2(β2/β1) (a log ratio is useful because it makes the difference symmetrical: β2 being twice as large leads to a log ratio of 1, while β1 being twice as large results in -1). To constrain it to a set of especially relevant words, we can filter for relatively common words, such as those that have a β greater than 1/1000 in at least one topic.

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.5.2

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread

## # A tibble: 198 x 4
##    term              topic1      topic2 log_ratio
##    <chr>              <dbl>       <dbl>     <dbl>
##  1 administration 0.000431  0.00138         1.68 
##  2 ago            0.00107   0.000842       -0.339
##  3 agreement      0.000671  0.00104         0.630
##  4 aid            0.0000476 0.00105         4.46 
##  5 air            0.00214   0.000297       -2.85 
##  6 american       0.00203   0.00168        -0.270
##  7 analysts       0.00109   0.000000578   -10.9  
##  8 area           0.00137   0.000231       -2.57 
##  9 army           0.000262  0.00105         2.00 
## 10 asked          0.000189  0.00156         3.05 
## # … with 188 more rows

beta_top_terms <- beta_spread %>%
  top_n(10, log_ratio) %>%
  arrange(log_ratio)

ggplot(beta_top_terms, aes(term, log_ratio ))+  geom_bar(stat="identity")+coord_flip()

6.1.2 Document-topic probabilities

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called γ (“gamma”), with the matrix = “gamma” argument to tidy().

ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents

## # A tibble: 4,492 x 3
##    document topic    gamma
##       <int> <int>    <dbl>
##  1        1     1 0.248   
##  2        2     1 0.362   
##  3        3     1 0.527   
##  4        4     1 0.357   
##  5        5     1 0.181   
##  6        6     1 0.000588
##  7        7     1 0.773   
##  8        8     1 0.00445 
##  9        9     1 0.967   
## 10       10     1 0.147   
## # … with 4,482 more rows

Each of these values is an estimated proportion of words from that document that are generated from that topic. For example, the model estimates that only about 25% of the words in document 1 were generated from topic 1.

We can see that many of these documents were drawn from a mix of the two topics, but that document 6 was drawn almost entirely from topic 2, having a γ from topic 1 close to zero. To check this answer, we could tidy() the document-term matrix (see Chapter 5.1) and check what the most common words in that document were.

tidy(AssociatedPress) %>%
  filter(document == 6) %>%
  arrange(desc(count))

## # A tibble: 287 x 3
##    document term           count
##       <int> <chr>          <dbl>
##  1        6 noriega           16
##  2        6 panama            12
##  3        6 jackson            6
##  4        6 powell             6
##  5        6 administration     5
##  6        6 economic           5
##  7        6 general            5
##  8        6 i                  5
##  9        6 panamanian         5
## 10        6 american           4
## # … with 277 more rows

Based on the most common words, this appears to be an article about the relationship between the American government and Panamanian dictator Manuel Noriega, which means the algorithm was right to place it in topic 2 (as political/national news).