LDA

Packages we will use

  1. Core packages

The main package that performs LDA is topicmodels. We will also use the ldatuning package to find parameters for LDA.

#install.packages("topicmodels")
library(topicmodels)

#install.packages("ldatuning")
library(ldatuning)
  1. Packages to preprocess the data
  • Refresher: What is a corpus? What is a document? What is a term?

  • Refresher: The three matrices that the LDA algorithm works on to produce the results are

    • document-term matrix (the focus of the preprocessing)

    • document-topic matrix

    • topic-term matrix

  • In preprocessing, we need to:

    • tokenize the terms from the documents

    • clean the data and conduct feature selection (Question: What does feature selection mean?)

    • construct the document-term matrix

  • We’ve used tidytext before - it is the package we will use for tokenization and document-term matrix

  • We will also use dplyr and tidyr (both a part of tidyverse) to transform data frames

  • We will use SnowballC to implement Porter’s word stemming

library(tidytext)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

#install.packages("SnowballC")
library(SnowballC)
  1. Package for visualization

We will use ggplot for visualization. It is also a part of the tidyverse and is the most widely used visualization tool in R. We can use it to plot all types of data, not just topic modeling results.

#install.packages("ggplot2")
library(ggplot2)

Finally, a trick to make life easier. Do you remember what the below code does?

options(scipen = 100)

Preprocessing

Today we will use a dataset available in a R package, rather than use read.csv to load our own data. In tutorials and textbooks, you often see these “built-in” datasets being used to demonstrate the codes. This is mainly for the purpose of convenience and universality.

The dataset we are using today contains the transcripts and metadata for TV show Friends.

#install.packages("friends")
library(friends)
str(friends)
## tibble [67,373 × 6] (S3: tbl_df/tbl/data.frame)
##  $ text     : chr [1:67373] "There's nothing to tell! He's just some guy I work with!" "C'mon, you're going out with the guy! There's gotta be something wrong with him!" "All right Joey, be nice. So does he have a hump? A hump and a hairpiece?" "Wait, does he eat chalk?" ...
##  $ speaker  : chr [1:67373] "Monica Geller" "Joey Tribbiani" "Chandler Bing" "Phoebe Buffay" ...
##  $ season   : int [1:67373] 1 1 1 1 1 1 1 1 1 1 ...
##  $ episode  : int [1:67373] 1 1 1 1 1 1 1 1 1 1 ...
##  $ scene    : int [1:67373] 1 1 1 1 1 1 1 1 1 1 ...
##  $ utterance: int [1:67373] 1 2 3 4 5 6 7 8 9 10 ...

First, let’s combine lines from the same episode to get the data in the structure we want (we are not changing the actual content of the data in this step):

data <- friends %>% 
  mutate(season = paste("s", season, sep = ""), 
         id = paste(season, episode, sep = "_e")) %>% # mutate() is for creating, modifying, or deleting columns
         group_by(id, season) %>% # group_by() function alone will not give any output; it should be followed by functions such as summarise() to perform an action by group
         summarise(transcript = paste(text, collapse = ' '))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
head(data)
## # A tibble: 6 × 3
## # Groups:   id [6]
##   id      season transcript                                                     
##   <chr>   <chr>  <chr>                                                          
## 1 s10_e1  s10    "[Scene: Barbados, Monica and Chandler's Room. They both enter…
## 2 s10_e10 s10    "[Scene: Central Perk. Everybody's sitting on the couch and Mo…
## 3 s10_e11 s10    "[Scene: Central Perk. Everyone's sitting on the couch. Monica…
## 4 s10_e12 s10    "[Scene: Central Perk. Joey's sitting on the couch and Phoebe …
## 5 s10_e13 s10    "[Scene: Central Perk. Everyone's sitting on the couch and Pho…
## 6 s10_e14 s10    "[Scene: Chandler and Monica's apartment. They are having a di…

Now we begin preprocessing.

The codes in the below chunk to tokenize and remove built-in stop words are something we’ve done before:

data %>%
  unnest_tokens(word, transcript) %>% # tokenize the corpus
  anti_join(stop_words, by = "word") %>% # remove the stop words
  count(word, sort = TRUE) # word frequency
## # A tibble: 107,113 × 3
## # Groups:   id [236]
##    id     word       n
##    <chr>  <chr>  <int>
##  1 s4_e24 ross      68
##  2 s4_e21 ross      54
##  3 s7_e21 monica    52
##  4 s6_e9  ross      51
##  5 s9_e9  ross      51
##  6 s4_e20 yeah      49
##  7 s4_e24 emily     47
##  8 s6_e23 ring      47
##  9 s8_e19 joey      46
## 10 s1_e19 marcel    45
## # ℹ 107,103 more rows

Constructing a custom stop words list containing interjections and filler words:

custom_stop <- data.frame(
  word =c("yeah","hey","gonna","scene","uh","y'know","umm","huh","wanna","wow","ah","um","ya","ooh","alright","ohh","bye","whoa","kinda","ha","uhh","ow","uhm","hmm","ahh","wh","em","ugh"))

Now we do proper cleaning:

clean_data <- data %>%
  unnest_tokens(word, transcript) %>% 
  anti_join(stop_words, by = "word") %>% 
  anti_join(custom_stop, by = "word") %>% # further get rid of custom stop words
  mutate(stem = wordStem(word)) %>% # stemming
  count(id, stem) %>% # count word stems in each transcript
  arrange(-n) %>% # order the data by count
  mutate(season = gsub("(.*)_.*", id, replacement = "\\1")) %>% # reattach season (won't use it for LDA, will use for STM)
  bind_tf_idf(id, stem, n) # compute tf-idf score

summary(clean_data$tf_idf)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001024 0.027382 0.077582 0.449639 0.318593 4.037025
hist(clean_data$tf_idf)

clean_data <- clean_data %>%
  filter(tf_idf > 0.027382) # only keep words with higher tf-idf
  
dtm <- clean_data %>%
  cast_dtm(id, stem, n) # finally, we can create a document-term matrix

Set parameters (k, alpha, etc.)

  1. Find the number of topics

Here we use several indexes to find the “appropriate” number of topics in the statistical sense. But as we said last week, this should be an iterative process where you inspect LDA results produced by several different k values and select the one that is most useful for answering your substantive question.

The ldatuning package provides four metrics developed by past research (Griffiths 2004, Cao Juan et al 2009, Arun et al 2010, and Deveaud et al 2014).

results <- FindTopicsNumber(
  dtm, # document term matrix
  topics = seq(from = 3, to = 20, by = 1), # range of topics
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), # four metrics
  method = "Gibbs", # Gibbs is a sampling method
  control = list(seed = 77), # set a seed for replicable results
  verbose = TRUE # if FALSE (default), suppress all warnings and additional information
) # this should take a minute to run with the current data
## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.
##   Deveaud2014... done.
FindTopicsNumber_plot(results)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
##   Please report the issue at <https://github.com/nikita-moor/ldatuning/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can also check perplexity scores.

  • The perplexity function is a function in topicmodels that generate the perplexity scores of LDA models

  • We can estimate three LDA models for three different k values, and generate perplexity scores for each of the three models.

  • We’ve used lapply(X, FUN) before: it applies a function (FUN) to the corresponding element of X.

k_values <- c(9, 10, 11)

lda_compare <- k_values %>% 
  lapply(x = dtm, LDA, control = list(seed = 1109)) # estimate three models
# this should take a minute to run with the current data

p <- data_frame(k_values = k_values, perplex_scores = as.numeric(lapply(lda_compare, perplexity))) # estimate perplexity scores and build a data frame for plotting
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p
## # A tibble: 3 × 2
##   k_values perplex_scores
##      <dbl>          <dbl>
## 1        9          1896.
## 2       10          1826.
## 3       11          1765.

We can visualize this using ggplot (note that + rather than %>% connects the different commands under ggplot):

ggplot(p, aes(k_values, perplex_scores)) +
  geom_point() +
  geom_line() +
  labs(title = "Evaluating LDA topic models",
       subtitle = "Optimal number of topics (smaller is better)",
       x = "Number of topics",
       y = "Perplexity")

  1. Define hyperparameters

We will need to define characteristics for the document-topic matrix (alpha in topicmodels) and characteristics for the topic-term matrix (delta in topicmodels).

It is good practice to set these in a separate line of code before running the model.

k <- 11

control_LDA_Gibbs <- list(
  
  alpha = 50/k, # the starting value for alpha is 50/k suggested by Griffiths & Steyvers (2004)
  delta = 0.1, # the starting value for the hyperparameter characterizing the prior topic-term distribution is 0.1 suggested by Griffiths & Steyvers (2004)
  estimate.beta = TRUE, # estimate the posterior topic-term distribution
  
  verbose = 500, # if 0 (default), no output is generated during model fitting; if a positive integer, then the progress is reported every verbose of iterations
  save = 0,  # if 0 (default), no intermediate results are saved in files; if a positive integer, every save iterations intermediate results are saved
  keep = 0, # if a positive integer, the log-likelihood is saved every keep iterations
  prefix = tempfile(), # path indicating where to save the intermediate results
  
  seed = 999, # random seed for reproducibility
  nstart = 1, # number of repeated random starts
  best = TRUE, # returns only the best one model
  
  iter = 2000, # number of Gibbs iterations, by default equals 2000
  burnin = 0, # number of Gibbs iterations discarded, by defaul 0
  thin = 2000) # then every 2000th Gibbs iteration is returned, by defaul equal to iter

Run LDA

friends_lda <- LDA(dtm, k=k, method="Gibbs", control = control_LDA_Gibbs)
## K = 11; V = 12408; M = 236
## Sampling 2000 iterations!
## Iteration 500 ...
## Iteration 1000 ...
## Iteration 1500 ...
## Iteration 2000 ...
## Gibbs sampling completed!

LDA output

str(friends_lda)
## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
##   ..@ seedwords      : NULL
##   ..@ z              : int [1:133734] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ alpha          : num 4.55
##   ..@ call           : language LDA(x = dtm, k = k, method = "Gibbs", control = control_LDA_Gibbs)
##   ..@ Dim            : int [1:2] 236 12408
##   ..@ control        :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
##   .. .. ..@ delta        : num 0.1
##   .. .. ..@ iter         : int 2000
##   .. .. ..@ thin         : int 2000
##   .. .. ..@ burnin       : int 0
##   .. .. ..@ initialize   : chr "random"
##   .. .. ..@ alpha        : num 4.55
##   .. .. ..@ seed         : int 999
##   .. .. ..@ verbose      : int 500
##   .. .. ..@ prefix       : chr "/var/folders/97/dfqcl8wx5gq6_mg2xfh_jwkc0000gn/T//RtmpuXI4UN/file1146b114a88bd"
##   .. .. ..@ save         : int 0
##   .. .. ..@ nstart       : int 1
##   .. .. ..@ best         : logi TRUE
##   .. .. ..@ keep         : int 0
##   .. .. ..@ estimate.beta: logi TRUE
##   ..@ k              : int 11
##   ..@ terms          : chr [1:12408] "ross" "gui" "monica" "ring" ...
##   ..@ documents      : chr [1:236] "s4_e24" "s6_e19" "s4_e21" "s7_e21" ...
##   ..@ beta           : num [1:11, 1:12408] -3.16 -11.82 -11.7 -11.85 -11.84 ...
##   ..@ gamma          : num [1:236, 1:11] 0.314 0.116 0.54 0.685 0.619 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:70671] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ j   : int [1:70671] 1 3 4 5 8 15 16 17 18 19 ...
##   .. ..$ v   : num [1:70671] 1 1 9 9 1 1 1 9 7 1 ...
##   .. ..$ nrow: int 236
##   .. ..$ ncol: int 12408
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num -955778
##   ..@ iter           : int 2000
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 133734
  1. Topic-term

Quick glance at topic-term distribution:

terms(friends_lda, 10)
##       Topic 1    Topic 2   Topic 3    Topic 4  Topic 5 Topic 6     Topic 7 
##  [1,] "ross"     "night"   "gui"      "talk"   "game"  "date"      "carol" 
##  [2,] "chandler" "hand"    "parti"    "friend" "move"  "offic"     "marcel"
##  [3,] "joei"     "monei"   "ben"      "happen" "mom"   "chandler'" "monkei"
##  [4,] "rachel"   "plai"    "birthdai" "stuff"  "throw" "enter"     "la"    
##  [5,] "monica"   "richard" "peopl"    "sit"    "joey'" "fire"      "cat"   
##  [6,] "cut"      "break"   "funni"    "couch"  "ball"  "dai"       "juli"  
##  [7,] "door"     "woman"   "kiss"     "kiss"   "smoke" "book"      "susan" 
##  [8,] "phoeb"    "bedroom" "christma" "watch"  "leav"  "class"     "life"  
##  [9,] "start"    "laugh"   "fun"      "lot"    "plai"  "read"      "woman" 
## [10,] "move"     "start"   "dad"      "perk"   "car"   "restaur"   "barri" 
##       Topic 8  Topic 9 Topic 10  Topic 11 
##  [1,] "babi"   "marri" "apart"   "call"   
##  [2,] "phoeb"  "wed"   "emma"    "rachel'"
##  [3,] "doctor" "emili" "mike"    "god"    
##  [4,] "honei"  "ring"  "love"    "joey'"  
##  [5,] "fine"   "dress" "monica'" "phone"  
##  [6,] "sex"    "love"  "feel"    "plai"   
##  [7,] "dr"     "wait"  "miss"    "hous"   
##  [8,] "father" "walk"  "charli"  "chair"  
##  [9,] "feel"   "plan"  "guess"   "line"   
## [10,] "massag" "wear"  "word"    "mark"

Visualize the topic-term distribution (labeled “beta” in topicmodels):

term_topic_matrix <- tidy(friends_lda, matrix="beta")
head(term_topic_matrix)
## # A tibble: 6 × 3
##   topic term        beta
##   <int> <chr>      <dbl>
## 1     1 ross  0.0423    
## 2     2 ross  0.00000734
## 3     3 ross  0.00000828
## 4     4 ross  0.00000716
## 5     5 ross  0.00000718
## 6     6 ross  0.00000691
top_terms <- term_topic_matrix %>% 
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
head(top_terms)
## # A tibble: 6 × 3
##   topic term       beta
##   <int> <chr>     <dbl>
## 1     1 ross     0.0423
## 2     1 chandler 0.0398
## 3     1 joei     0.0343
## 4     1 rachel   0.0340
## 5     1 monica   0.0276
## 6     1 cut      0.0208
ggplot(top_terms, aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

  1. Document-topic

The document-topic distribution is labeled “gamma” in topicmodels:

topic_doc_matrix <- tidy(friends_lda, matrix="gamma")

topic_doc_wider <- topic_doc_matrix %>%
  mutate(topic = paste("topic", topic, sep = "")) %>%
  pivot_wider(names_from = topic, values_from = gamma)
#View(topic_doc_wider)

See the documents with highest proportions of topic 2:

topic_doc_wider %>%
  arrange(-topic2)
## # A tibble: 236 × 12
##    document topic1 topic2  topic3 topic4 topic5  topic6  topic7 topic8 topic9
##    <chr>     <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
##  1 s5_e23   0.0708  0.584 0.00882 0.0236 0.0371 0.0304  0.0183  0.0115 0.153 
##  2 s5_e24   0.0155  0.527 0.0345  0.0184 0.0697 0.00960 0.0169  0.0140 0.206 
##  3 s1_e5    0.0293  0.509 0.0143  0.0594 0.0461 0.00927 0.0979  0.0611 0.0829
##  4 s7_e15   0.0173  0.493 0.0109  0.0173 0.0288 0.0619  0.0224  0.126  0.0696
##  5 s6_e17   0.0418  0.491 0.0276  0.0528 0.0702 0.116   0.0229  0.0355 0.0639
##  6 s3_e24   0.0482  0.476 0.0315  0.117  0.0594 0.0790  0.0231  0.0119 0.0399
##  7 s3_e23   0.0364  0.470 0.0154  0.0186 0.0348 0.136   0.0912  0.0412 0.0444
##  8 s3_e15   0.0444  0.454 0.0202  0.0751 0.0541 0.0751  0.0235  0.0202 0.0299
##  9 s3_e13   0.0239  0.400 0.0292  0.114  0.0609 0.206   0.0415  0.0292 0.0274
## 10 s5_e22   0.253   0.396 0.0256  0.0244 0.0993 0.0394  0.00817 0.0406 0.0319
## # ℹ 226 more rows
## # ℹ 2 more variables: topic10 <dbl>, topic11 <dbl>

The top two are:

IMDB page for episode The One with the Vows

IMDB page for episode The One Where Everybody Finds Out

You can also re-attach the transcripts and investigate the data outside of R:

str(data)
## gropd_df [236 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ id        : chr [1:236] "s10_e1" "s10_e10" "s10_e11" "s10_e12" ...
##  $ season    : chr [1:236] "s10" "s10" "s10" "s10" ...
##  $ transcript: chr [1:236] "[Scene: Barbados, Monica and Chandler's Room. They both enter from Ross's room. Monica still has her big, frizz"| __truncated__ "[Scene: Central Perk. Everybody's sitting on the couch and Monica is eating a chunk of cake.] Mmh... this cake "| __truncated__ "[Scene: Central Perk. Everyone's sitting on the couch. Monica and Joey enter.] Hey guys! Hey, let me tell them!"| __truncated__ "[Scene: Central Perk. Joey's sitting on the couch and Phoebe enters.] Oh, hey Joey. Uh, hey. Listen, I need to "| __truncated__ ...
##  - attr(*, "groups")= tibble [236 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ id   : chr [1:236] "s10_e1" "s10_e10" "s10_e11" "s10_e12" ...
##   ..$ .rows: list<int> [1:236] 
##   .. ..$ : int 1
##   .. ..$ : int 2
##   .. ..$ : int 3
##   .. ..$ : int 4
##   .. ..$ : int 5
##   .. ..$ : int 6
##   .. ..$ : int 7
##   .. ..$ : int 8
##   .. ..$ : int 9
##   .. ..$ : int 10
##   .. ..$ : int 11
##   .. ..$ : int 12
##   .. ..$ : int 13
##   .. ..$ : int 14
##   .. ..$ : int 15
##   .. ..$ : int 16
##   .. ..$ : int 17
##   .. ..$ : int 18
##   .. ..$ : int 19
##   .. ..$ : int 20
##   .. ..$ : int 21
##   .. ..$ : int 22
##   .. ..$ : int 23
##   .. ..$ : int 24
##   .. ..$ : int 25
##   .. ..$ : int 26
##   .. ..$ : int 27
##   .. ..$ : int 28
##   .. ..$ : int 29
##   .. ..$ : int 30
##   .. ..$ : int 31
##   .. ..$ : int 32
##   .. ..$ : int 33
##   .. ..$ : int 34
##   .. ..$ : int 35
##   .. ..$ : int 36
##   .. ..$ : int 37
##   .. ..$ : int 38
##   .. ..$ : int 39
##   .. ..$ : int 40
##   .. ..$ : int 41
##   .. ..$ : int 42
##   .. ..$ : int 43
##   .. ..$ : int 44
##   .. ..$ : int 45
##   .. ..$ : int 46
##   .. ..$ : int 47
##   .. ..$ : int 48
##   .. ..$ : int 49
##   .. ..$ : int 50
##   .. ..$ : int 51
##   .. ..$ : int 52
##   .. ..$ : int 53
##   .. ..$ : int 54
##   .. ..$ : int 55
##   .. ..$ : int 56
##   .. ..$ : int 57
##   .. ..$ : int 58
##   .. ..$ : int 59
##   .. ..$ : int 60
##   .. ..$ : int 61
##   .. ..$ : int 62
##   .. ..$ : int 63
##   .. ..$ : int 64
##   .. ..$ : int 65
##   .. ..$ : int 66
##   .. ..$ : int 67
##   .. ..$ : int 68
##   .. ..$ : int 69
##   .. ..$ : int 70
##   .. ..$ : int 71
##   .. ..$ : int 72
##   .. ..$ : int 73
##   .. ..$ : int 74
##   .. ..$ : int 75
##   .. ..$ : int 76
##   .. ..$ : int 77
##   .. ..$ : int 78
##   .. ..$ : int 79
##   .. ..$ : int 80
##   .. ..$ : int 81
##   .. ..$ : int 82
##   .. ..$ : int 83
##   .. ..$ : int 84
##   .. ..$ : int 85
##   .. ..$ : int 86
##   .. ..$ : int 87
##   .. ..$ : int 88
##   .. ..$ : int 89
##   .. ..$ : int 90
##   .. ..$ : int 91
##   .. ..$ : int 92
##   .. ..$ : int 93
##   .. ..$ : int 94
##   .. ..$ : int 95
##   .. ..$ : int 96
##   .. ..$ : int 97
##   .. ..$ : int 98
##   .. ..$ : int 99
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
data.r <- data %>%
  rename(document = id)
topic_doc_with_text <- merge(topic_doc_wider, data.r, by="document")
write.csv(topic_doc_with_text, "topic_doc_with_text.csv")  
  1. Plotting the topic prevalence
topic_prevalence <- topic_doc_wider %>%
  summarise_if(is.numeric,mean) %>%
  pivot_longer(cols = everything(), names_to = "topic", values_to = "prevalence")

ggplot(topic_prevalence, aes(x=reorder(topic, prevalence), prevalence)) +
  geom_col(show.legend = FALSE, color = "steelblue", fill = "grey", width = 0.1) +
  theme_classic() + 
  xlab("") +
  coord_flip()

STM

Package we will use

The package stm is the main one that performs the analysis

#install.packages("stm")
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

Preprocessing

  1. The steps for preprocessing we went through under LDA are applicable for STM as well. If you want more control over preprocessing, I recommend that you go through those steps.

  2. stm also provides two built-in functions:

  • textProcessor performs removal of punctions, numbers, and stop words, and stemming

  • prepDocuments performs further manipulations including removing words based on frequency and renumbering word indices

data_stm_clean <- textProcessor(
  data$transcript, # documents to be processed
  metadata = data, # additional data about the documents, need to be a data frame or matrix. This does not affect text processing, but insures that if documents are dropped the corresponding metadata rows are dropped as well
  lowercase = TRUE, # default is TRUE
  removestopwords = TRUE, # default is TRUE
  removenumbers = TRUE, # default is TRUE
  removepunctuation = TRUE, # default is TRUE
  wordLengths = c(3, Inf)) # default is retaining words with more than 3 characters
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...
str(data_stm_clean) 
## List of 4
##  $ documents   :List of 236
##   ..$ 1  : int [1:2, 1:544] 148 1 225 1 235 2 258 1 273 1 ...
##   ..$ 2  : int [1:2, 1:545] 87 1 196 1 243 1 313 1 322 2 ...
##   ..$ 3  : int [1:2, 1:669] 19 1 87 1 290 1 322 2 354 5 ...
##   ..$ 4  : int [1:2, 1:645] 258 1 294 1 314 1 322 1 335 2 ...
##   ..$ 5  : int [1:2, 1:575] 163 1 273 1 294 1 300 1 322 1 ...
##   ..$ 6  : int [1:2, 1:517] 87 1 88 1 94 1 98 1 148 2 ...
##   ..$ 7  : int [1:2, 1:501] 28 1 65 1 136 1 258 2 273 1 ...
##   ..$ 8  : int [1:2, 1:524] 148 1 258 2 302 2 314 1 322 1 ...
##   ..$ 9  : int [1:2, 1:572] 285 1 322 3 350 3 410 1 446 3 ...
##   ..$ 10 : int [1:2, 1:350] 322 1 369 2 446 4 507 2 523 6 ...
##   ..$ 11 : int [1:2, 1:459] 214 1 273 1 274 1 289 1 322 4 ...
##   ..$ 12 : int [1:2, 1:484] 107 1 243 1 273 3 284 2 289 3 ...
##   ..$ 13 : int [1:2, 1:572] 28 1 148 1 243 1 294 1 320 2 ...
##   ..$ 14 : int [1:2, 1:613] 148 1 168 1 273 2 287 1 322 1 ...
##   ..$ 15 : int [1:2, 1:595] 88 1 258 1 273 3 295 1 313 1 ...
##   ..$ 16 : int [1:2, 1:590] 37 1 41 1 224 1 235 1 237 1 ...
##   ..$ 17 : int [1:2, 1:552] 148 2 250 1 313 1 314 1 322 1 ...
##   ..$ 18 : int [1:2, 1:489] 148 1 152 1 310 1 322 5 350 8 ...
##   ..$ 19 : int [1:2, 1:687] 258 1 276 1 290 1 313 1 320 1 ...
##   ..$ 20 : int [1:2, 1:586] 23 1 51 1 76 1 87 1 149 1 ...
##   ..$ 21 : int [1:2, 1:594] 148 1 235 1 273 1 322 1 348 1 ...
##   ..$ 22 : int [1:2, 1:476] 1 1 98 2 146 1 273 1 321 1 ...
##   ..$ 23 : int [1:2, 1:515] 26 1 34 1 85 1 98 1 232 1 ...
##   ..$ 24 : int [1:2, 1:475] 254 1 258 1 322 2 331 1 463 1 ...
##   ..$ 25 : int [1:2, 1:559] 69 2 153 1 186 1 228 1 230 1 ...
##   ..$ 26 : int [1:2, 1:764] 273 1 289 1 313 1 314 1 316 2 ...
##   ..$ 27 : int [1:2, 1:759] 244 1 251 1 258 1 313 3 332 1 ...
##   ..$ 28 : int [1:2, 1:524] 2 1 3 1 70 1 273 1 322 2 ...
##   ..$ 29 : int [1:2, 1:576] 98 1 148 1 389 1 392 2 441 1 ...
##   ..$ 30 : int [1:2, 1:523] 148 1 273 1 276 1 314 1 322 3 ...
##   ..$ 31 : int [1:2, 1:518] 1 2 290 1 313 3 314 1 322 6 ...
##   ..$ 32 : int [1:2, 1:529] 61 2 273 1 286 1 322 2 355 1 ...
##   ..$ 33 : int [1:2, 1:484] 14 1 70 1 112 1 114 1 322 4 ...
##   ..$ 34 : int [1:2, 1:513] 273 1 292 1 320 1 322 1 334 1 ...
##   ..$ 35 : int [1:2, 1:568] 2 1 87 2 107 1 225 1 258 1 ...
##   ..$ 36 : int [1:2, 1:530] 15 4 133 3 148 2 159 1 258 1 ...
##   ..$ 37 : int [1:2, 1:565] 85 1 87 1 240 1 295 1 313 2 ...
##   ..$ 38 : int [1:2, 1:540] 100 1 155 1 284 1 290 1 320 2 ...
##   ..$ 39 : int [1:2, 1:568] 107 1 148 1 203 1 258 2 314 1 ...
##   ..$ 40 : int [1:2, 1:519] 273 2 290 1 295 1 313 1 314 1 ...
##   ..$ 41 : int [1:2, 1:512] 148 1 189 1 269 1 273 1 289 1 ...
##   ..$ 42 : int [1:2, 1:543] 63 1 259 1 273 1 309 1 320 1 ...
##   ..$ 43 : int [1:2, 1:456] 1 1 2 1 258 1 321 1 326 1 ...
##   ..$ 44 : int [1:2, 1:528] 314 1 322 6 365 1 380 1 382 1 ...
##   ..$ 45 : int [1:2, 1:523] 285 1 289 1 314 1 322 6 336 1 ...
##   ..$ 46 : int [1:2, 1:554] 20 1 77 1 195 2 320 1 322 3 ...
##   ..$ 47 : int [1:2, 1:507] 227 3 273 1 307 1 320 1 322 1 ...
##   ..$ 48 : int [1:2, 1:538] 234 1 282 1 285 1 322 5 332 2 ...
##   ..$ 49 : int [1:2, 1:487] 233 1 238 1 273 1 311 1 322 1 ...
##   ..$ 50 : int [1:2, 1:500] 215 2 258 1 296 2 322 4 368 1 ...
##   ..$ 51 : int [1:2, 1:481] 120 2 162 1 285 1 294 1 322 1 ...
##   ..$ 52 : int [1:2, 1:551] 278 1 289 1 322 4 365 1 369 2 ...
##   ..$ 53 : int [1:2, 1:595] 238 1 239 1 273 1 284 1 294 1 ...
##   ..$ 54 : int [1:2, 1:449] 276 1 320 1 322 5 355 1 358 1 ...
##   ..$ 55 : int [1:2, 1:589] 12 1 322 3 387 1 389 2 410 1 ...
##   ..$ 56 : int [1:2, 1:538] 245 1 252 1 273 1 289 1 322 5 ...
##   ..$ 57 : int [1:2, 1:560] 270 1 313 1 322 2 335 1 386 1 ...
##   ..$ 58 : int [1:2, 1:560] 314 2 320 1 322 1 364 1 414 2 ...
##   ..$ 59 : int [1:2, 1:483] 273 1 316 1 320 2 332 1 369 2 ...
##   ..$ 60 : int [1:2, 1:498] 252 1 289 1 294 1 322 5 331 2 ...
##   ..$ 61 : int [1:2, 1:481] 229 1 236 1 273 1 280 1 322 4 ...
##   ..$ 62 : int [1:2, 1:497] 258 1 273 1 294 1 322 1 363 1 ...
##   ..$ 63 : int [1:2, 1:495] 290 1 322 5 350 1 351 1 374 2 ...
##   ..$ 64 : int [1:2, 1:426] 258 1 322 4 361 1 369 1 410 2 ...
##   ..$ 65 : int [1:2, 1:499] 272 1 322 4 410 1 488 1 507 1 ...
##   ..$ 66 : int [1:2, 1:577] 30 1 122 1 123 1 257 1 280 1 ...
##   ..$ 67 : int [1:2, 1:521] 87 1 285 1 355 1 378 1 528 1 ...
##   ..$ 68 : int [1:2, 1:587] 29 1 257 1 258 1 273 1 289 1 ...
##   ..$ 69 : int [1:2, 1:591] 1 1 96 1 201 1 273 1 322 4 ...
##   ..$ 70 : int [1:2, 1:548] 268 1 322 6 339 1 374 1 378 1 ...
##   ..$ 71 : int [1:2, 1:547] 89 1 258 1 273 1 289 1 322 2 ...
##   ..$ 72 : int [1:2, 1:558] 87 3 252 1 288 1 322 6 380 1 ...
##   ..$ 73 : int [1:2, 1:516] 1 1 87 1 258 1 273 1 285 1 ...
##   ..$ 74 : int [1:2, 1:515] 98 1 258 1 273 1 314 1 322 4 ...
##   ..$ 75 : int [1:2, 1:567] 1 1 3 1 87 3 221 1 258 1 ...
##   ..$ 76 : int [1:2, 1:576] 2 1 76 1 87 1 203 1 273 2 ...
##   ..$ 77 : int [1:2, 1:552] 178 1 258 1 273 2 314 1 320 2 ...
##   ..$ 78 : int [1:2, 1:483] 232 1 252 1 293 1 314 1 322 2 ...
##   ..$ 79 : int [1:2, 1:548] 97 1 98 1 180 1 273 1 314 1 ...
##   ..$ 80 : int [1:2, 1:562] 88 1 273 2 285 2 313 1 378 1 ...
##   ..$ 81 : int [1:2, 1:549] 273 1 313 1 320 2 321 1 322 2 ...
##   ..$ 82 : int [1:2, 1:564] 1 1 88 1 116 1 156 1 222 1 ...
##   ..$ 83 : int [1:2, 1:648] 2 1 98 1 142 1 150 1 255 1 ...
##   ..$ 84 : int [1:2, 1:598] 87 1 98 1 150 1 173 1 260 1 ...
##   ..$ 85 : int [1:2, 1:544] 87 2 194 1 285 1 290 1 302 1 ...
##   ..$ 86 : int [1:2, 1:562] 39 1 70 1 87 2 284 2 290 2 ...
##   ..$ 87 : int [1:2, 1:574] 87 1 322 3 346 2 348 1 358 1 ...
##   ..$ 88 : int [1:2, 1:515] 87 3 88 1 107 1 320 2 322 1 ...
##   ..$ 89 : int [1:2, 1:579] 258 1 283 1 313 1 314 8 320 4 ...
##   ..$ 90 : int [1:2, 1:477] 1 2 87 2 96 1 128 1 322 2 ...
##   ..$ 91 : int [1:2, 1:570] 1 1 98 1 258 1 285 1 289 1 ...
##   ..$ 92 : int [1:2, 1:480] 87 1 208 1 252 1 285 1 322 1 ...
##   ..$ 93 : int [1:2, 1:552] 87 1 98 2 119 1 180 1 258 1 ...
##   ..$ 94 : int [1:2, 1:487] 4 1 87 1 180 1 258 3 320 1 ...
##   ..$ 95 : int [1:2, 1:596] 50 1 94 1 131 1 147 1 294 1 ...
##   ..$ 96 : int [1:2, 1:594] 177 2 273 3 300 1 313 1 314 3 ...
##   ..$ 97 : int [1:2, 1:581] 82 1 87 1 196 1 216 2 305 1 ...
##   ..$ 98 : int [1:2, 1:597] 87 1 98 1 181 1 258 1 284 1 ...
##   ..$ 99 : int [1:2, 1:543] 139 1 180 1 215 1 216 1 217 1 ...
##   .. [list output truncated]
##  $ vocab       : chr [1:14037] "---" "----" "-----" "------" ...
##  $ meta        :'data.frame':    236 obs. of  3 variables:
##   ..$ id        : chr [1:236] "s10_e1" "s10_e10" "s10_e11" "s10_e12" ...
##   ..$ season    : chr [1:236] "s10" "s10" "s10" "s10" ...
##   ..$ transcript: chr [1:236] "[Scene: Barbados, Monica and Chandler's Room. They both enter from Ross's room. Monica still has her big, frizz"| __truncated__ "[Scene: Central Perk. Everybody's sitting on the couch and Monica is eating a chunk of cake.] Mmh... this cake "| __truncated__ "[Scene: Central Perk. Everyone's sitting on the couch. Monica and Joey enter.] Hey guys! Hey, let me tell them!"| __truncated__ "[Scene: Central Perk. Joey's sitting on the couch and Phoebe enters.] Oh, hey Joey. Uh, hey. Listen, I need to "| __truncated__ ...
##  $ docs.removed: int(0) 
##  - attr(*, "class")= chr "textProcessor"
data_stm_processed <- prepDocuments(documents = data_stm_clean$documents, 
                     vocab = data_stm_clean$vocab, 
                     meta = data_stm_clean$meta)
## Removing 7663 of 14037 terms (7663 of 129661 tokens) due to frequency 
## Your corpus now has 236 documents, 6374 terms and 121998 tokens.
str(data_stm_processed)
## List of 7
##  $ documents     :List of 236
##   ..$ 1  : int [1:2, 1:509] 44 1 67 1 70 2 80 1 84 1 ...
##   ..$ 2  : int [1:2, 1:525] 24 1 53 1 75 1 106 1 112 2 ...
##   ..$ 3  : int [1:2, 1:609] 24 1 94 1 112 2 125 5 142 1 ...
##   ..$ 4  : int [1:2, 1:614] 80 1 97 1 107 1 112 1 118 2 ...
##   ..$ 5  : int [1:2, 1:487] 84 1 97 1 99 1 112 1 128 3 ...
##   ..$ 6  : int [1:2, 1:486] 24 1 25 1 30 1 33 1 44 2 ...
##   ..$ 7  : int [1:2, 1:473] 8 1 80 2 84 1 106 2 112 3 ...
##   ..$ 8  : int [1:2, 1:502] 44 1 80 2 101 2 107 1 112 1 ...
##   ..$ 9  : int [1:2, 1:543] 91 1 112 3 123 3 154 1 168 3 ...
##   ..$ 10 : int [1:2, 1:341] 112 1 137 2 168 4 191 2 200 6 ...
##   ..$ 11 : int [1:2, 1:437] 84 1 85 1 93 1 112 4 119 1 ...
##   ..$ 12 : int [1:2, 1:455] 35 1 75 1 84 3 90 2 93 3 ...
##   ..$ 13 : int [1:2, 1:542] 8 1 44 1 75 1 97 1 110 2 ...
##   ..$ 14 : int [1:2, 1:573] 44 1 84 2 112 1 122 1 123 2 ...
##   ..$ 15 : int [1:2, 1:545] 25 1 80 1 84 3 98 1 106 1 ...
##   ..$ 16 : int [1:2, 1:571] 70 1 72 1 84 3 91 1 107 1 ...
##   ..$ 17 : int [1:2, 1:519] 44 2 106 1 107 1 112 1 123 3 ...
##   ..$ 18 : int [1:2, 1:468] 44 1 105 1 112 5 123 8 130 1 ...
##   ..$ 19 : int [1:2, 1:649] 80 1 86 1 94 1 106 1 110 1 ...
##   ..$ 20 : int [1:2, 1:538] 7 1 20 1 24 1 56 1 75 1 ...
##   ..$ 21 : int [1:2, 1:545] 44 1 70 1 84 1 112 1 122 1 ...
##   ..$ 22 : int [1:2, 1:424] 1 1 33 2 84 1 111 1 112 1 ...
##   ..$ 23 : int [1:2, 1:466] 23 1 33 1 68 1 80 1 93 1 ...
##   ..$ 24 : int [1:2, 1:435] 80 1 112 2 114 1 176 1 191 1 ...
##   ..$ 25 : int [1:2, 1:509] 107 1 112 1 122 1 147 1 154 1 ...
##   ..$ 26 : int [1:2, 1:709] 84 1 93 1 106 1 107 1 108 2 ...
##   ..$ 27 : int [1:2, 1:687] 80 1 106 3 115 1 118 1 122 1 ...
##   ..$ 28 : int [1:2, 1:486] 2 1 3 1 19 1 84 1 112 2 ...
##   ..$ 29 : int [1:2, 1:526] 33 1 44 1 147 1 149 2 166 1 ...
##   ..$ 30 : int [1:2, 1:487] 44 1 84 1 86 1 107 1 112 3 ...
##   ..$ 31 : int [1:2, 1:476] 1 2 94 1 106 3 107 1 112 6 ...
##   ..$ 32 : int [1:2, 1:494] 84 1 92 1 112 2 126 1 137 1 ...
##   ..$ 33 : int [1:2, 1:444] 19 1 112 4 142 1 147 1 185 2 ...
##   ..$ 34 : int [1:2, 1:491] 84 1 95 1 110 1 112 1 117 1 ...
##   ..$ 35 : int [1:2, 1:533] 2 1 24 2 35 1 67 1 80 1 ...
##   ..$ 36 : int [1:2, 1:482] 44 2 80 1 91 3 98 2 106 1 ...
##   ..$ 37 : int [1:2, 1:522] 23 1 24 1 74 1 98 1 106 2 ...
##   ..$ 38 : int [1:2, 1:506] 90 1 94 1 110 2 112 2 142 1 ...
##   ..$ 39 : int [1:2, 1:536] 35 1 44 1 55 1 80 2 107 1 ...
##   ..$ 40 : int [1:2, 1:479] 84 2 94 1 98 1 106 1 107 1 ...
##   ..$ 41 : int [1:2, 1:477] 44 1 84 1 93 1 112 3 166 1 ...
##   ..$ 42 : int [1:2, 1:512] 17 1 81 1 84 1 104 1 110 1 ...
##   ..$ 43 : int [1:2, 1:431] 1 1 2 1 80 1 111 1 124 1 ...
##   ..$ 44 : int [1:2, 1:487] 107 1 112 6 135 1 143 1 144 1 ...
##   ..$ 45 : int [1:2, 1:496] 91 1 93 1 107 1 112 6 137 1 ...
##   ..$ 46 : int [1:2, 1:507] 21 1 110 1 112 3 118 1 121 1 ...
##   ..$ 47 : int [1:2, 1:473] 84 1 103 1 110 1 112 1 122 1 ...
##   ..$ 48 : int [1:2, 1:512] 69 1 91 1 112 5 115 2 147 1 ...
##   ..$ 49 : int [1:2, 1:455] 73 1 84 1 112 1 114 1 142 1 ...
##   ..$ 50 : int [1:2, 1:475] 61 2 80 1 112 4 136 1 147 1 ...
##   ..$ 51 : int [1:2, 1:446] 91 1 97 1 112 1 126 1 147 1 ...
##   ..$ 52 : int [1:2, 1:504] 93 1 112 4 135 1 137 2 156 1 ...
##   ..$ 53 : int [1:2, 1:540] 73 1 84 1 90 1 97 1 107 1 ...
##   ..$ 54 : int [1:2, 1:428] 86 1 110 1 112 5 126 1 128 1 ...
##   ..$ 55 : int [1:2, 1:543] 112 3 147 2 154 1 191 2 198 1 ...
##   ..$ 56 : int [1:2, 1:501] 76 1 78 1 84 1 93 1 112 5 ...
##   ..$ 57 : int [1:2, 1:514] 106 1 112 2 118 1 146 1 149 1 ...
##   ..$ 58 : int [1:2, 1:528] 107 2 110 1 112 1 134 1 156 2 ...
##   ..$ 59 : int [1:2, 1:456] 84 1 108 1 110 2 115 1 137 2 ...
##   ..$ 60 : int [1:2, 1:471] 78 1 93 1 97 1 112 5 114 2 ...
##   ..$ 61 : int [1:2, 1:465] 71 1 84 1 88 1 112 4 147 1 ...
##   ..$ 62 : int [1:2, 1:463] 80 1 84 1 97 1 112 1 133 1 ...
##   ..$ 63 : int [1:2, 1:455] 94 1 112 5 123 1 124 1 140 2 ...
##   ..$ 64 : int [1:2, 1:407] 80 1 112 4 131 1 137 1 154 2 ...
##   ..$ 65 : int [1:2, 1:461] 112 4 154 1 186 1 191 1 199 1 ...
##   ..$ 66 : int [1:2, 1:539] 79 1 88 1 112 2 156 3 172 1 ...
##   ..$ 67 : int [1:2, 1:482] 24 1 91 1 126 1 142 1 202 1 ...
##   ..$ 68 : int [1:2, 1:542] 79 1 80 1 84 1 93 1 106 1 ...
##   ..$ 69 : int [1:2, 1:551] 1 1 31 1 84 1 112 4 140 1 ...
##   ..$ 70 : int [1:2, 1:515] 112 6 140 1 142 1 144 1 156 1 ...
##   ..$ 71 : int [1:2, 1:527] 26 1 80 1 84 1 93 1 112 2 ...
##   ..$ 72 : int [1:2, 1:529] 24 3 78 1 112 6 143 1 147 2 ...
##   ..$ 73 : int [1:2, 1:493] 1 1 24 1 80 1 84 1 91 1 ...
##   ..$ 74 : int [1:2, 1:497] 33 1 80 1 84 1 107 1 112 4 ...
##   ..$ 75 : int [1:2, 1:534] 1 1 3 1 24 3 65 1 80 1 ...
##   ..$ 76 : int [1:2, 1:559] 2 1 20 1 24 1 55 1 84 2 ...
##   ..$ 77 : int [1:2, 1:523] 80 1 84 2 107 1 110 2 111 1 ...
##   ..$ 78 : int [1:2, 1:461] 68 1 78 1 96 1 107 1 112 2 ...
##   ..$ 79 : int [1:2, 1:523] 32 1 33 1 49 1 84 1 107 1 ...
##   ..$ 80 : int [1:2, 1:537] 25 1 84 2 91 2 106 1 142 1 ...
##   ..$ 81 : int [1:2, 1:516] 84 1 106 1 110 2 111 1 112 2 ...
##   ..$ 82 : int [1:2, 1:518] 1 1 25 1 66 1 80 1 81 2 ...
##   ..$ 83 : int [1:2, 1:606] 2 1 33 1 41 1 45 1 82 1 ...
##   ..$ 84 : int [1:2, 1:568] 24 1 33 1 45 1 48 1 82 1 ...
##   ..$ 85 : int [1:2, 1:522] 24 2 91 1 94 1 101 1 110 1 ...
##   ..$ 86 : int [1:2, 1:537] 19 1 24 2 90 2 94 2 107 4 ...
##   ..$ 87 : int [1:2, 1:530] 24 1 112 3 121 2 122 1 128 1 ...
##   ..$ 88 : int [1:2, 1:498] 24 3 25 1 35 1 110 2 112 1 ...
##   ..$ 89 : int [1:2, 1:546] 80 1 89 1 106 1 107 8 110 4 ...
##   ..$ 90 : int [1:2, 1:452] 1 2 24 2 31 1 112 2 154 1 ...
##   ..$ 91 : int [1:2, 1:533] 1 1 33 1 80 1 91 1 93 1 ...
##   ..$ 92 : int [1:2, 1:466] 24 1 57 1 78 1 91 1 112 1 ...
##   ..$ 93 : int [1:2, 1:513] 24 1 33 2 49 1 80 1 84 1 ...
##   ..$ 94 : int [1:2, 1:463] 4 1 24 1 49 1 80 3 110 1 ...
##   ..$ 95 : int [1:2, 1:563] 30 1 43 1 97 1 112 4 118 1 ...
##   ..$ 96 : int [1:2, 1:564] 84 3 99 1 106 1 107 3 110 2 ...
##   ..$ 97 : int [1:2, 1:541] 24 1 53 1 62 2 102 1 110 4 ...
##   ..$ 98 : int [1:2, 1:560] 24 1 33 1 80 1 90 1 110 1 ...
##   ..$ 99 : int [1:2, 1:512] 40 1 49 1 61 1 62 1 79 1 ...
##   .. [list output truncated]
##  $ vocab         : chr [1:6374] "---" "----" "-----" "------" ...
##  $ meta          :'data.frame':  236 obs. of  3 variables:
##   ..$ id        : chr [1:236] "s10_e1" "s10_e10" "s10_e11" "s10_e12" ...
##   ..$ season    : chr [1:236] "s10" "s10" "s10" "s10" ...
##   ..$ transcript: chr [1:236] "[Scene: Barbados, Monica and Chandler's Room. They both enter from Ross's room. Monica still has her big, frizz"| __truncated__ "[Scene: Central Perk. Everybody's sitting on the couch and Monica is eating a chunk of cake.] Mmh... this cake "| __truncated__ "[Scene: Central Perk. Everyone's sitting on the couch. Monica and Joey enter.] Hey guys! Hey, let me tell them!"| __truncated__ "[Scene: Central Perk. Joey's sitting on the couch and Phoebe enters.] Oh, hey Joey. Uh, hey. Listen, I need to "| __truncated__ ...
##  $ words.removed : chr [1:7663] "---crotch" "---go" "--alittl" "--anyway" ...
##  $ docs.removed  : NULL
##  $ tokens.removed: int 7663
##  $ wordcounts    : int [1:14037] 38 19 8 6 2 1 1 1 1 1 ...
#data_stm_processed$words.removed

Set parameter (k)

The stm package has a function called searchK which allows the user to:

  • specify a range of values for k

  • run STM models for each value of k

  • and output multiple goodness-of-fit measures

The syntax of this function is very similar to the main stm function, except that the user specifies a range for k as one of the arguments. Therefore, searchK can take a long time to run.

In today’s class, you skip running this part and look at the output directly.

  1. Using the data preprocessed by stm built-in functions
stmcompare1 <- searchK(documents = data_stm_processed$documents, 
                      vocab = data_stm_processed$vocab, 
                      K = c(5:11),
                      prevalence = ~ season, # covariates predicting topic prevalence (document-topic)
                      # content = ~ season # covariates predicting predict topic content (topic-term)
                      data = data_stm_processed$meta, # data for covariates
                      init.type = "Spectral", # spectral initialization is recommended for stability
                      max.em.its = 75,
                      verbose = TRUE) # print messages
plot(stmcompare1)

We want to maximize held-out likelihood, semantic coherence, lower-bound of the marginal likelihood (internal measure of fit), and minimize residuals. As with LDA, there is no sigle best k value - we will use k = 6 in this tutorial.

  1. Using data preprocessed by custom written codes

Recall that earlier we preprocessed the data ourselves and generated a data object clean_data. We then used cast_dtm to generate a form of document-term matrix for LDA.

Besides objects returned by built-in function textProcessor and prepDocuments, the stm package also takes document-term matrix, but in a specific form. So below, we use cast_sparse to generate the form of document-term matrix that stm package recognizes and also the meta data the package will use.

data_custom_processed <- clean_data %>%
  cast_sparse(id, stem, n)

covariates <- clean_data %>% 
  distinct(id, season)
head(covariates)
## # A tibble: 6 × 2
## # Groups:   id [6]
##   id     season
##   <chr>  <chr> 
## 1 s4_e24 s4    
## 2 s6_e19 s6    
## 3 s4_e21 s4    
## 4 s7_e21 s7    
## 5 s6_e9  s6    
## 6 s9_e9  s9
stmcompare2 <- searchK(documents = data_custom_processed, # when using a matrix rather than a stm object, only documents need to be specified; vocab should be omitted
                      K = c(5:11),
                      N = floor(0.1 * nrow(covariates)), # need to set this when using a matrix. N specifies the number of documents to be held-out as testing set, with the default as floor(0.1 * length(documents)) for stm objects. Because we are not using stm objects, we substitute # of documents with # of rows in the covariates data (they are the same).
                      prevalence = ~ season, 
                      data = covariates, # covariate data frame we just constructed
                      init.type = "Spectral", 
                      max.em.its = 75,
                      verbose = TRUE) 
plot(stmcompare2)

Run STM

  1. Using the data preprocessed by stm built-in functions
friends_stm1 <- stm(documents = data_stm_processed$documents,
                    vocab = data_stm_processed$vocab,
                    K = 6, 
                    prevalence = ~ season, 
                    data = data_stm_processed$meta, 
                    init.type = "Spectral",
                    max.em.its = 75,
                    verbose = TRUE) 
  1. Using data preprocessed by custom written codes
friends_stm2 <- stm(documents = data_custom_processed, 
                    K = 6, 
                    prevalence = ~ season, 
                    data = covariates,
                    init.type = "Spectral",
                    max.em.its = 75,
                    verbose = TRUE) 

STM output

  1. Topic-term

Quick glance at topic prevalence and keywords:

plot(friends_stm1)

plot(friends_stm2)

Different metrics:

see ??labelTopics for full documentation. Highest Prob represents words within each topic with the highest probability (inferred directly from topic-word distribution paramete. FREX represents words that are both frequent and exclusive, identifying words that distinguish topics.

labelTopics(friends_stm1)
## Topic 1 Top Words:
##       Highest Prob: yeah, just, okay, well, hey, dont, yknow 
##       FREX: kathi, joshua, mark, date, chef, ralph, stripper 
##       Lift: centerpiec, ahh-ahh, alessandro, allesandro, ambush, arcad, atmospher 
##       Score: snore, joshua, yknow, kathi, kim, joanna, allesandro 
## Topic 2 Top Words:
##       Highest Prob: okay, chandler, just, yeah, monica, dont, well 
##       FREX: ball, chip, bike, cat, thanksgiv, flashback, turkey 
##       Lift: ahh-ahhh, ahhhhh, bast, bedtim, bendi, bicep, blackjack 
##       Score: touchdown, bonni, thanksgiv, huddl, yemen, poker, jessica 
## Topic 3 Top Words:
##       Highest Prob: know, just, well, dont, yeah, hey, like 
##       FREX: mike, woah, uhm, emma, charli, eddi, whatr 
##       Lift: -caus, aaahh, aaahhh, aaahhhh, aahhh, academ, ad-lib 
##       Score: uhm, emma, duncan, woah, charli, eddi, alright 
## Topic 4 Top Words:
##       Highest Prob: ross, okay, rachel, just, yeah, well, joey 
##       FREX: emili, london, jill, joke, rosss, honeymoon, video 
##       Lift: --re, -anyway, -iw, -will, beef-tip, blargon, but 
##       Score: clip, emili, waltham, thee, chloe, jill, yknow 
## Topic 5 Top Words:
##       Highest Prob: okay, just, yeah, right, well, know, get 
##       FREX: car, paul, porsch, chair, boot, ursula, barri 
##       Lift: aaron, aruba, batman, breezi, knockin, oberman, omelet 
##       Score: breezi, paul, batman, porsch, tuxedo, semi-priv, yknow 
## Topic 6 Top Words:
##       Highest Prob: okay, just, know, yeah, dont, get, hey 
##       FREX: marcel, monkey, santa, christma, alright, erica, cmon 
##       Lift: heartbeat, part-jewish, ygo, armadillo, antacid, crier, dent 
##       Score: newark, marcel, erica, alright, tulsa, armadillo, emma
labelTopics(friends_stm2)
## Topic 1 Top Words:
##       Highest Prob: apart, love, plai, call, hand, tabl, cat 
##       FREX: paolo, woah, lotteri, eddi, ethan, poker, cat 
##       Lift: consuela, gucci, tenur, doesnt, mee, omnipot, goodacr 
##       Score: love, lotteri, erica, ethan, eddi, dont, nina 
## Topic 2 Top Words:
##       Highest Prob: babi, joei, ben, emma, talk, kid, parti 
##       FREX: babi, ben, emma, nurs, ami, bike, santa 
##       Lift: m'appel, hulk, eyr, tarantula, francett, telemarket, nomin 
##       Score: joei, babi, emma, talk, gavin, ami, je 
## Topic 3 Top Words:
##       Highest Prob: ross, chandler, monica, rachel, joei, emili, phoeb 
##       FREX: ross, monica, emili, chandler, laura, phoeb, london 
##       Lift: bagpip, lockhart, escort, barcelona, yemen, huddl, thee 
##       Score: ross, monica, chandler, rachel, joei, phoeb, emili 
## Topic 4 Top Words:
##       Highest Prob: gui, kiss, apart, parti, job, night, offic 
##       FREX: gui, angela, charli, sandi, mari, cassi, minsk 
##       Lift: hobart, thompson, gladi, comet, lennart, hilda, blowfish 
##       Score: gui, charli, cassi, sandi, gladi, platform, minsk 
## Topic 5 Top Words:
##       Highest Prob: rachel, date, move, marri, live, love, phoeb 
##       FREX: annul, date, kathi, move, perk, central, time 
##       Lift: sergei, russ, tobi, annul, thirtieth, 007, cujo 
##       Score: rachel, annul, mona, phoeb, love, sergei, mockol 
## Topic 6 Top Words:
##       Highest Prob: ring, joey', wait, chandler, god, gotta, marcel 
##       FREX: mindi, barri, paul, pete, ring, marcel, heckl 
##       Lift: mindi, gandolf, ronni, processor, heston, caitlin, gala 
##       Score: chandler, mindi, pete, marcel, wait, vinc, heckl

2. Document-Topic

findThoughts(friends_stm2, # model we fitted
             texts = data$id, # texts = data$transcripts for full transcripts
             n = 3, # three most relevant documents
             topics = 5) # for topic 5
## 
##  Topic 5: 
##       s1_e11
##      s5_e14
##      s3_e11

IMDB page for s1e6

IMDB page for s5e14

IMDB page for s3e11

3. Effects of covariates

season_effect <- estimateEffect(1: 6 ~ season, 
                                friends_stm2, 
                                meta = covariates, 
                                uncertainty = "Global")
summary(season_effect, topics = 2)
## 
## Call:
## estimateEffect(formula = 1:6 ~ season, stmobj = friends_stm2, 
##     metadata = covariates, uncertainty = "Global")
## 
## 
## Topic 2:
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12202    0.07289   1.674   0.0955 .  
## seasons10    0.23850    0.11252   2.120   0.0351 *  
## seasons2     0.02267    0.10811   0.210   0.8341    
## seasons3    -0.10789    0.10022  -1.077   0.2828    
## seasons4     0.02764    0.10616   0.260   0.7948    
## seasons5     0.04048    0.10195   0.397   0.6917    
## seasons6    -0.07988    0.09959  -0.802   0.4233    
## seasons7     0.22058    0.11140   1.980   0.0489 *  
## seasons8     0.46942    0.10977   4.277 0.000028 ***
## seasons9     0.15165    0.10795   1.405   0.1615    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(season_effect, 
     covariate = "season", 
     topics = c(1:6),
     model = friends_stm2, 
     method = "difference",
     cov.value1 = "s10", # "treatment" group
     cov.value2 = "s1", # "control" group (baseline of comparison)
     xlab = "More likely to be in s1          More likely to be in s10",
     main = "Effect of s1 vs s10",
     labeltype = "custom",
     custom.labels = c("Topic 1", "Topic 2","Topic 3","Topic 4","Topic 5", "Topic 6"))

Exercise

Conduct LDA topic modeling by engaging in the following steps:

  • Step 0: Download “beyonce.csv” to your folder. Open a new R script and save it in your folder. Set working directory to the folder. Use rm(list=ls()) to empty your environment.

  • Step 1: Tokenize the lyrics column. Inspect the word frequency

  • Step 2: Remove stop words (use your best judgment to construct a stop words dictionary)

  • Step 3: Stemming

  • Step 4: Use tf-idf for feature selection

  • Step 5: Create a document-term matrix

  • Step 6: Find an appropriate k value

  • Step 7: Define hyperparameters

  • Step 8: RUN LDA

  • Step 9: What are the words under each topic?

  • Step 10: What are some of the representative songs reflecting each topic?