The main package that performs LDA is topicmodels. We
will also use the ldatuning package to find parameters for
LDA.
#install.packages("topicmodels")
library(topicmodels)
#install.packages("ldatuning")
library(ldatuning)
Refresher: What is a corpus? What is a document? What is a term?
Refresher: The three matrices that the LDA algorithm works on to produce the results are
document-term matrix (the focus of the preprocessing)
document-topic matrix
topic-term matrix
In preprocessing, we need to:
tokenize the terms from the documents
clean the data and conduct feature selection (Question: What does feature selection mean?)
construct the document-term matrix
We’ve used tidytext before - it is the package we
will use for tokenization and document-term matrix
We will also use dplyr and tidyr (both
a part of tidyverse) to transform data frames
We will use SnowballC to implement Porter’s word
stemming
library(tidytext)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
#install.packages("SnowballC")
library(SnowballC)
We will use ggplot for visualization. It is also a part
of the tidyverse and is the most widely used visualization
tool in R. We can use it to plot all types of data, not just topic
modeling results.
#install.packages("ggplot2")
library(ggplot2)
Finally, a trick to make life easier. Do you remember what the below code does?
options(scipen = 100)
Today we will use a dataset available in a R package, rather than use
read.csv to load our own data. In tutorials and textbooks,
you often see these “built-in” datasets being used to demonstrate the
codes. This is mainly for the purpose of convenience and
universality.
The dataset we are using today contains the transcripts and metadata for TV show Friends.
#install.packages("friends")
library(friends)
str(friends)
## tibble [67,373 × 6] (S3: tbl_df/tbl/data.frame)
## $ text : chr [1:67373] "There's nothing to tell! He's just some guy I work with!" "C'mon, you're going out with the guy! There's gotta be something wrong with him!" "All right Joey, be nice. So does he have a hump? A hump and a hairpiece?" "Wait, does he eat chalk?" ...
## $ speaker : chr [1:67373] "Monica Geller" "Joey Tribbiani" "Chandler Bing" "Phoebe Buffay" ...
## $ season : int [1:67373] 1 1 1 1 1 1 1 1 1 1 ...
## $ episode : int [1:67373] 1 1 1 1 1 1 1 1 1 1 ...
## $ scene : int [1:67373] 1 1 1 1 1 1 1 1 1 1 ...
## $ utterance: int [1:67373] 1 2 3 4 5 6 7 8 9 10 ...
First, let’s combine lines from the same episode to get the data in the structure we want (we are not changing the actual content of the data in this step):
data <- friends %>%
mutate(season = paste("s", season, sep = ""),
id = paste(season, episode, sep = "_e")) %>% # mutate() is for creating, modifying, or deleting columns
group_by(id, season) %>% # group_by() function alone will not give any output; it should be followed by functions such as summarise() to perform an action by group
summarise(transcript = paste(text, collapse = ' '))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
head(data)
## # A tibble: 6 × 3
## # Groups: id [6]
## id season transcript
## <chr> <chr> <chr>
## 1 s10_e1 s10 "[Scene: Barbados, Monica and Chandler's Room. They both enter…
## 2 s10_e10 s10 "[Scene: Central Perk. Everybody's sitting on the couch and Mo…
## 3 s10_e11 s10 "[Scene: Central Perk. Everyone's sitting on the couch. Monica…
## 4 s10_e12 s10 "[Scene: Central Perk. Joey's sitting on the couch and Phoebe …
## 5 s10_e13 s10 "[Scene: Central Perk. Everyone's sitting on the couch and Pho…
## 6 s10_e14 s10 "[Scene: Chandler and Monica's apartment. They are having a di…
Now we begin preprocessing.
The codes in the below chunk to tokenize and remove built-in stop words are something we’ve done before:
data %>%
unnest_tokens(word, transcript) %>% # tokenize the corpus
anti_join(stop_words, by = "word") %>% # remove the stop words
count(word, sort = TRUE) # word frequency
## # A tibble: 107,113 × 3
## # Groups: id [236]
## id word n
## <chr> <chr> <int>
## 1 s4_e24 ross 68
## 2 s4_e21 ross 54
## 3 s7_e21 monica 52
## 4 s6_e9 ross 51
## 5 s9_e9 ross 51
## 6 s4_e20 yeah 49
## 7 s4_e24 emily 47
## 8 s6_e23 ring 47
## 9 s8_e19 joey 46
## 10 s1_e19 marcel 45
## # ℹ 107,103 more rows
Constructing a custom stop words list containing interjections and filler words:
custom_stop <- data.frame(
word =c("yeah","hey","gonna","scene","uh","y'know","umm","huh","wanna","wow","ah","um","ya","ooh","alright","ohh","bye","whoa","kinda","ha","uhh","ow","uhm","hmm","ahh","wh","em","ugh"))
Now we do proper cleaning:
clean_data <- data %>%
unnest_tokens(word, transcript) %>%
anti_join(stop_words, by = "word") %>%
anti_join(custom_stop, by = "word") %>% # further get rid of custom stop words
mutate(stem = wordStem(word)) %>% # stemming
count(id, stem) %>% # count word stems in each transcript
arrange(-n) %>% # order the data by count
mutate(season = gsub("(.*)_.*", id, replacement = "\\1")) %>% # reattach season (won't use it for LDA, will use for STM)
bind_tf_idf(id, stem, n) # compute tf-idf score
summary(clean_data$tf_idf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001024 0.027382 0.077582 0.449639 0.318593 4.037025
hist(clean_data$tf_idf)
clean_data <- clean_data %>%
filter(tf_idf > 0.027382) # only keep words with higher tf-idf
dtm <- clean_data %>%
cast_dtm(id, stem, n) # finally, we can create a document-term matrix
Here we use several indexes to find the “appropriate” number of topics in the statistical sense. But as we said last week, this should be an iterative process where you inspect LDA results produced by several different k values and select the one that is most useful for answering your substantive question.
The ldatuning package provides four metrics developed by
past research (Griffiths 2004, Cao Juan et al 2009, Arun et al 2010, and
Deveaud et al 2014).
results <- FindTopicsNumber(
dtm, # document term matrix
topics = seq(from = 3, to = 20, by = 1), # range of topics
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), # four metrics
method = "Gibbs", # Gibbs is a sampling method
control = list(seed = 77), # set a seed for replicable results
verbose = TRUE # if FALSE (default), suppress all warnings and additional information
) # this should take a minute to run with the current data
## fit models... done.
## calculate metrics:
## Griffiths2004... done.
## CaoJuan2009... done.
## Arun2010... done.
## Deveaud2014... done.
FindTopicsNumber_plot(results)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
## Please report the issue at <https://github.com/nikita-moor/ldatuning/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We can also check perplexity scores.
The perplexity function is a function in
topicmodels that generate the perplexity scores of LDA
models
We can estimate three LDA models for three different k values, and generate perplexity scores for each of the three models.
We’ve used lapply(X, FUN) before: it applies a
function (FUN) to the corresponding element of X.
k_values <- c(9, 10, 11)
lda_compare <- k_values %>%
lapply(x = dtm, LDA, control = list(seed = 1109)) # estimate three models
# this should take a minute to run with the current data
p <- data_frame(k_values = k_values, perplex_scores = as.numeric(lapply(lda_compare, perplexity))) # estimate perplexity scores and build a data frame for plotting
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p
## # A tibble: 3 × 2
## k_values perplex_scores
## <dbl> <dbl>
## 1 9 1896.
## 2 10 1826.
## 3 11 1765.
We can visualize this using ggplot (note that
+ rather than %>% connects the different
commands under ggplot):
ggplot(p, aes(k_values, perplex_scores)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")
We will need to define characteristics for the document-topic matrix
(alpha in topicmodels) and characteristics for
the topic-term matrix (delta in
topicmodels).
It is good practice to set these in a separate line of code before running the model.
k <- 11
control_LDA_Gibbs <- list(
alpha = 50/k, # the starting value for alpha is 50/k suggested by Griffiths & Steyvers (2004)
delta = 0.1, # the starting value for the hyperparameter characterizing the prior topic-term distribution is 0.1 suggested by Griffiths & Steyvers (2004)
estimate.beta = TRUE, # estimate the posterior topic-term distribution
verbose = 500, # if 0 (default), no output is generated during model fitting; if a positive integer, then the progress is reported every verbose of iterations
save = 0, # if 0 (default), no intermediate results are saved in files; if a positive integer, every save iterations intermediate results are saved
keep = 0, # if a positive integer, the log-likelihood is saved every keep iterations
prefix = tempfile(), # path indicating where to save the intermediate results
seed = 999, # random seed for reproducibility
nstart = 1, # number of repeated random starts
best = TRUE, # returns only the best one model
iter = 2000, # number of Gibbs iterations, by default equals 2000
burnin = 0, # number of Gibbs iterations discarded, by defaul 0
thin = 2000) # then every 2000th Gibbs iteration is returned, by defaul equal to iter
friends_lda <- LDA(dtm, k=k, method="Gibbs", control = control_LDA_Gibbs)
## K = 11; V = 12408; M = 236
## Sampling 2000 iterations!
## Iteration 500 ...
## Iteration 1000 ...
## Iteration 1500 ...
## Iteration 2000 ...
## Gibbs sampling completed!
str(friends_lda)
## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
## ..@ seedwords : NULL
## ..@ z : int [1:133734] 1 1 1 1 1 1 1 1 1 1 ...
## ..@ alpha : num 4.55
## ..@ call : language LDA(x = dtm, k = k, method = "Gibbs", control = control_LDA_Gibbs)
## ..@ Dim : int [1:2] 236 12408
## ..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
## .. .. ..@ delta : num 0.1
## .. .. ..@ iter : int 2000
## .. .. ..@ thin : int 2000
## .. .. ..@ burnin : int 0
## .. .. ..@ initialize : chr "random"
## .. .. ..@ alpha : num 4.55
## .. .. ..@ seed : int 999
## .. .. ..@ verbose : int 500
## .. .. ..@ prefix : chr "/var/folders/97/dfqcl8wx5gq6_mg2xfh_jwkc0000gn/T//RtmpuXI4UN/file1146b114a88bd"
## .. .. ..@ save : int 0
## .. .. ..@ nstart : int 1
## .. .. ..@ best : logi TRUE
## .. .. ..@ keep : int 0
## .. .. ..@ estimate.beta: logi TRUE
## ..@ k : int 11
## ..@ terms : chr [1:12408] "ross" "gui" "monica" "ring" ...
## ..@ documents : chr [1:236] "s4_e24" "s6_e19" "s4_e21" "s7_e21" ...
## ..@ beta : num [1:11, 1:12408] -3.16 -11.82 -11.7 -11.85 -11.84 ...
## ..@ gamma : num [1:236, 1:11] 0.314 0.116 0.54 0.685 0.619 ...
## ..@ wordassignments:List of 5
## .. ..$ i : int [1:70671] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ j : int [1:70671] 1 3 4 5 8 15 16 17 18 19 ...
## .. ..$ v : num [1:70671] 1 1 9 9 1 1 1 9 7 1 ...
## .. ..$ nrow: int 236
## .. ..$ ncol: int 12408
## .. ..- attr(*, "class")= chr "simple_triplet_matrix"
## ..@ loglikelihood : num -955778
## ..@ iter : int 2000
## ..@ logLiks : num(0)
## ..@ n : int 133734
Quick glance at topic-term distribution:
terms(friends_lda, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
## [1,] "ross" "night" "gui" "talk" "game" "date" "carol"
## [2,] "chandler" "hand" "parti" "friend" "move" "offic" "marcel"
## [3,] "joei" "monei" "ben" "happen" "mom" "chandler'" "monkei"
## [4,] "rachel" "plai" "birthdai" "stuff" "throw" "enter" "la"
## [5,] "monica" "richard" "peopl" "sit" "joey'" "fire" "cat"
## [6,] "cut" "break" "funni" "couch" "ball" "dai" "juli"
## [7,] "door" "woman" "kiss" "kiss" "smoke" "book" "susan"
## [8,] "phoeb" "bedroom" "christma" "watch" "leav" "class" "life"
## [9,] "start" "laugh" "fun" "lot" "plai" "read" "woman"
## [10,] "move" "start" "dad" "perk" "car" "restaur" "barri"
## Topic 8 Topic 9 Topic 10 Topic 11
## [1,] "babi" "marri" "apart" "call"
## [2,] "phoeb" "wed" "emma" "rachel'"
## [3,] "doctor" "emili" "mike" "god"
## [4,] "honei" "ring" "love" "joey'"
## [5,] "fine" "dress" "monica'" "phone"
## [6,] "sex" "love" "feel" "plai"
## [7,] "dr" "wait" "miss" "hous"
## [8,] "father" "walk" "charli" "chair"
## [9,] "feel" "plan" "guess" "line"
## [10,] "massag" "wear" "word" "mark"
Visualize the topic-term distribution (labeled “beta” in
topicmodels):
term_topic_matrix <- tidy(friends_lda, matrix="beta")
head(term_topic_matrix)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 ross 0.0423
## 2 2 ross 0.00000734
## 3 3 ross 0.00000828
## 4 4 ross 0.00000716
## 5 5 ross 0.00000718
## 6 6 ross 0.00000691
top_terms <- term_topic_matrix %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
head(top_terms)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 ross 0.0423
## 2 1 chandler 0.0398
## 3 1 joei 0.0343
## 4 1 rachel 0.0340
## 5 1 monica 0.0276
## 6 1 cut 0.0208
ggplot(top_terms, aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
The document-topic distribution is labeled “gamma” in
topicmodels:
topic_doc_matrix <- tidy(friends_lda, matrix="gamma")
topic_doc_wider <- topic_doc_matrix %>%
mutate(topic = paste("topic", topic, sep = "")) %>%
pivot_wider(names_from = topic, values_from = gamma)
#View(topic_doc_wider)
See the documents with highest proportions of topic 2:
topic_doc_wider %>%
arrange(-topic2)
## # A tibble: 236 × 12
## document topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 s5_e23 0.0708 0.584 0.00882 0.0236 0.0371 0.0304 0.0183 0.0115 0.153
## 2 s5_e24 0.0155 0.527 0.0345 0.0184 0.0697 0.00960 0.0169 0.0140 0.206
## 3 s1_e5 0.0293 0.509 0.0143 0.0594 0.0461 0.00927 0.0979 0.0611 0.0829
## 4 s7_e15 0.0173 0.493 0.0109 0.0173 0.0288 0.0619 0.0224 0.126 0.0696
## 5 s6_e17 0.0418 0.491 0.0276 0.0528 0.0702 0.116 0.0229 0.0355 0.0639
## 6 s3_e24 0.0482 0.476 0.0315 0.117 0.0594 0.0790 0.0231 0.0119 0.0399
## 7 s3_e23 0.0364 0.470 0.0154 0.0186 0.0348 0.136 0.0912 0.0412 0.0444
## 8 s3_e15 0.0444 0.454 0.0202 0.0751 0.0541 0.0751 0.0235 0.0202 0.0299
## 9 s3_e13 0.0239 0.400 0.0292 0.114 0.0609 0.206 0.0415 0.0292 0.0274
## 10 s5_e22 0.253 0.396 0.0256 0.0244 0.0993 0.0394 0.00817 0.0406 0.0319
## # ℹ 226 more rows
## # ℹ 2 more variables: topic10 <dbl>, topic11 <dbl>
The top two are:
IMDB page for episode The One with the Vows
IMDB page for episode The One Where Everybody Finds Out
You can also re-attach the transcripts and investigate the data outside of R:
str(data)
## gropd_df [236 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ id : chr [1:236] "s10_e1" "s10_e10" "s10_e11" "s10_e12" ...
## $ season : chr [1:236] "s10" "s10" "s10" "s10" ...
## $ transcript: chr [1:236] "[Scene: Barbados, Monica and Chandler's Room. They both enter from Ross's room. Monica still has her big, frizz"| __truncated__ "[Scene: Central Perk. Everybody's sitting on the couch and Monica is eating a chunk of cake.] Mmh... this cake "| __truncated__ "[Scene: Central Perk. Everyone's sitting on the couch. Monica and Joey enter.] Hey guys! Hey, let me tell them!"| __truncated__ "[Scene: Central Perk. Joey's sitting on the couch and Phoebe enters.] Oh, hey Joey. Uh, hey. Listen, I need to "| __truncated__ ...
## - attr(*, "groups")= tibble [236 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ id : chr [1:236] "s10_e1" "s10_e10" "s10_e11" "s10_e12" ...
## ..$ .rows: list<int> [1:236]
## .. ..$ : int 1
## .. ..$ : int 2
## .. ..$ : int 3
## .. ..$ : int 4
## .. ..$ : int 5
## .. ..$ : int 6
## .. ..$ : int 7
## .. ..$ : int 8
## .. ..$ : int 9
## .. ..$ : int 10
## .. ..$ : int 11
## .. ..$ : int 12
## .. ..$ : int 13
## .. ..$ : int 14
## .. ..$ : int 15
## .. ..$ : int 16
## .. ..$ : int 17
## .. ..$ : int 18
## .. ..$ : int 19
## .. ..$ : int 20
## .. ..$ : int 21
## .. ..$ : int 22
## .. ..$ : int 23
## .. ..$ : int 24
## .. ..$ : int 25
## .. ..$ : int 26
## .. ..$ : int 27
## .. ..$ : int 28
## .. ..$ : int 29
## .. ..$ : int 30
## .. ..$ : int 31
## .. ..$ : int 32
## .. ..$ : int 33
## .. ..$ : int 34
## .. ..$ : int 35
## .. ..$ : int 36
## .. ..$ : int 37
## .. ..$ : int 38
## .. ..$ : int 39
## .. ..$ : int 40
## .. ..$ : int 41
## .. ..$ : int 42
## .. ..$ : int 43
## .. ..$ : int 44
## .. ..$ : int 45
## .. ..$ : int 46
## .. ..$ : int 47
## .. ..$ : int 48
## .. ..$ : int 49
## .. ..$ : int 50
## .. ..$ : int 51
## .. ..$ : int 52
## .. ..$ : int 53
## .. ..$ : int 54
## .. ..$ : int 55
## .. ..$ : int 56
## .. ..$ : int 57
## .. ..$ : int 58
## .. ..$ : int 59
## .. ..$ : int 60
## .. ..$ : int 61
## .. ..$ : int 62
## .. ..$ : int 63
## .. ..$ : int 64
## .. ..$ : int 65
## .. ..$ : int 66
## .. ..$ : int 67
## .. ..$ : int 68
## .. ..$ : int 69
## .. ..$ : int 70
## .. ..$ : int 71
## .. ..$ : int 72
## .. ..$ : int 73
## .. ..$ : int 74
## .. ..$ : int 75
## .. ..$ : int 76
## .. ..$ : int 77
## .. ..$ : int 78
## .. ..$ : int 79
## .. ..$ : int 80
## .. ..$ : int 81
## .. ..$ : int 82
## .. ..$ : int 83
## .. ..$ : int 84
## .. ..$ : int 85
## .. ..$ : int 86
## .. ..$ : int 87
## .. ..$ : int 88
## .. ..$ : int 89
## .. ..$ : int 90
## .. ..$ : int 91
## .. ..$ : int 92
## .. ..$ : int 93
## .. ..$ : int 94
## .. ..$ : int 95
## .. ..$ : int 96
## .. ..$ : int 97
## .. ..$ : int 98
## .. ..$ : int 99
## .. .. [list output truncated]
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
data.r <- data %>%
rename(document = id)
topic_doc_with_text <- merge(topic_doc_wider, data.r, by="document")
write.csv(topic_doc_with_text, "topic_doc_with_text.csv")
topic_prevalence <- topic_doc_wider %>%
summarise_if(is.numeric,mean) %>%
pivot_longer(cols = everything(), names_to = "topic", values_to = "prevalence")
ggplot(topic_prevalence, aes(x=reorder(topic, prevalence), prevalence)) +
geom_col(show.legend = FALSE, color = "steelblue", fill = "grey", width = 0.1) +
theme_classic() +
xlab("") +
coord_flip()
The package stm is the main one that performs the
analysis
#install.packages("stm")
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
The steps for preprocessing we went through under LDA are applicable for STM as well. If you want more control over preprocessing, I recommend that you go through those steps.
stm also provides two built-in
functions:
textProcessor performs removal of punctions,
numbers, and stop words, and stemming
prepDocuments performs further manipulations
including removing words based on frequency and renumbering word
indices
data_stm_clean <- textProcessor(
data$transcript, # documents to be processed
metadata = data, # additional data about the documents, need to be a data frame or matrix. This does not affect text processing, but insures that if documents are dropped the corresponding metadata rows are dropped as well
lowercase = TRUE, # default is TRUE
removestopwords = TRUE, # default is TRUE
removenumbers = TRUE, # default is TRUE
removepunctuation = TRUE, # default is TRUE
wordLengths = c(3, Inf)) # default is retaining words with more than 3 characters
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
str(data_stm_clean)
## List of 4
## $ documents :List of 236
## ..$ 1 : int [1:2, 1:544] 148 1 225 1 235 2 258 1 273 1 ...
## ..$ 2 : int [1:2, 1:545] 87 1 196 1 243 1 313 1 322 2 ...
## ..$ 3 : int [1:2, 1:669] 19 1 87 1 290 1 322 2 354 5 ...
## ..$ 4 : int [1:2, 1:645] 258 1 294 1 314 1 322 1 335 2 ...
## ..$ 5 : int [1:2, 1:575] 163 1 273 1 294 1 300 1 322 1 ...
## ..$ 6 : int [1:2, 1:517] 87 1 88 1 94 1 98 1 148 2 ...
## ..$ 7 : int [1:2, 1:501] 28 1 65 1 136 1 258 2 273 1 ...
## ..$ 8 : int [1:2, 1:524] 148 1 258 2 302 2 314 1 322 1 ...
## ..$ 9 : int [1:2, 1:572] 285 1 322 3 350 3 410 1 446 3 ...
## ..$ 10 : int [1:2, 1:350] 322 1 369 2 446 4 507 2 523 6 ...
## ..$ 11 : int [1:2, 1:459] 214 1 273 1 274 1 289 1 322 4 ...
## ..$ 12 : int [1:2, 1:484] 107 1 243 1 273 3 284 2 289 3 ...
## ..$ 13 : int [1:2, 1:572] 28 1 148 1 243 1 294 1 320 2 ...
## ..$ 14 : int [1:2, 1:613] 148 1 168 1 273 2 287 1 322 1 ...
## ..$ 15 : int [1:2, 1:595] 88 1 258 1 273 3 295 1 313 1 ...
## ..$ 16 : int [1:2, 1:590] 37 1 41 1 224 1 235 1 237 1 ...
## ..$ 17 : int [1:2, 1:552] 148 2 250 1 313 1 314 1 322 1 ...
## ..$ 18 : int [1:2, 1:489] 148 1 152 1 310 1 322 5 350 8 ...
## ..$ 19 : int [1:2, 1:687] 258 1 276 1 290 1 313 1 320 1 ...
## ..$ 20 : int [1:2, 1:586] 23 1 51 1 76 1 87 1 149 1 ...
## ..$ 21 : int [1:2, 1:594] 148 1 235 1 273 1 322 1 348 1 ...
## ..$ 22 : int [1:2, 1:476] 1 1 98 2 146 1 273 1 321 1 ...
## ..$ 23 : int [1:2, 1:515] 26 1 34 1 85 1 98 1 232 1 ...
## ..$ 24 : int [1:2, 1:475] 254 1 258 1 322 2 331 1 463 1 ...
## ..$ 25 : int [1:2, 1:559] 69 2 153 1 186 1 228 1 230 1 ...
## ..$ 26 : int [1:2, 1:764] 273 1 289 1 313 1 314 1 316 2 ...
## ..$ 27 : int [1:2, 1:759] 244 1 251 1 258 1 313 3 332 1 ...
## ..$ 28 : int [1:2, 1:524] 2 1 3 1 70 1 273 1 322 2 ...
## ..$ 29 : int [1:2, 1:576] 98 1 148 1 389 1 392 2 441 1 ...
## ..$ 30 : int [1:2, 1:523] 148 1 273 1 276 1 314 1 322 3 ...
## ..$ 31 : int [1:2, 1:518] 1 2 290 1 313 3 314 1 322 6 ...
## ..$ 32 : int [1:2, 1:529] 61 2 273 1 286 1 322 2 355 1 ...
## ..$ 33 : int [1:2, 1:484] 14 1 70 1 112 1 114 1 322 4 ...
## ..$ 34 : int [1:2, 1:513] 273 1 292 1 320 1 322 1 334 1 ...
## ..$ 35 : int [1:2, 1:568] 2 1 87 2 107 1 225 1 258 1 ...
## ..$ 36 : int [1:2, 1:530] 15 4 133 3 148 2 159 1 258 1 ...
## ..$ 37 : int [1:2, 1:565] 85 1 87 1 240 1 295 1 313 2 ...
## ..$ 38 : int [1:2, 1:540] 100 1 155 1 284 1 290 1 320 2 ...
## ..$ 39 : int [1:2, 1:568] 107 1 148 1 203 1 258 2 314 1 ...
## ..$ 40 : int [1:2, 1:519] 273 2 290 1 295 1 313 1 314 1 ...
## ..$ 41 : int [1:2, 1:512] 148 1 189 1 269 1 273 1 289 1 ...
## ..$ 42 : int [1:2, 1:543] 63 1 259 1 273 1 309 1 320 1 ...
## ..$ 43 : int [1:2, 1:456] 1 1 2 1 258 1 321 1 326 1 ...
## ..$ 44 : int [1:2, 1:528] 314 1 322 6 365 1 380 1 382 1 ...
## ..$ 45 : int [1:2, 1:523] 285 1 289 1 314 1 322 6 336 1 ...
## ..$ 46 : int [1:2, 1:554] 20 1 77 1 195 2 320 1 322 3 ...
## ..$ 47 : int [1:2, 1:507] 227 3 273 1 307 1 320 1 322 1 ...
## ..$ 48 : int [1:2, 1:538] 234 1 282 1 285 1 322 5 332 2 ...
## ..$ 49 : int [1:2, 1:487] 233 1 238 1 273 1 311 1 322 1 ...
## ..$ 50 : int [1:2, 1:500] 215 2 258 1 296 2 322 4 368 1 ...
## ..$ 51 : int [1:2, 1:481] 120 2 162 1 285 1 294 1 322 1 ...
## ..$ 52 : int [1:2, 1:551] 278 1 289 1 322 4 365 1 369 2 ...
## ..$ 53 : int [1:2, 1:595] 238 1 239 1 273 1 284 1 294 1 ...
## ..$ 54 : int [1:2, 1:449] 276 1 320 1 322 5 355 1 358 1 ...
## ..$ 55 : int [1:2, 1:589] 12 1 322 3 387 1 389 2 410 1 ...
## ..$ 56 : int [1:2, 1:538] 245 1 252 1 273 1 289 1 322 5 ...
## ..$ 57 : int [1:2, 1:560] 270 1 313 1 322 2 335 1 386 1 ...
## ..$ 58 : int [1:2, 1:560] 314 2 320 1 322 1 364 1 414 2 ...
## ..$ 59 : int [1:2, 1:483] 273 1 316 1 320 2 332 1 369 2 ...
## ..$ 60 : int [1:2, 1:498] 252 1 289 1 294 1 322 5 331 2 ...
## ..$ 61 : int [1:2, 1:481] 229 1 236 1 273 1 280 1 322 4 ...
## ..$ 62 : int [1:2, 1:497] 258 1 273 1 294 1 322 1 363 1 ...
## ..$ 63 : int [1:2, 1:495] 290 1 322 5 350 1 351 1 374 2 ...
## ..$ 64 : int [1:2, 1:426] 258 1 322 4 361 1 369 1 410 2 ...
## ..$ 65 : int [1:2, 1:499] 272 1 322 4 410 1 488 1 507 1 ...
## ..$ 66 : int [1:2, 1:577] 30 1 122 1 123 1 257 1 280 1 ...
## ..$ 67 : int [1:2, 1:521] 87 1 285 1 355 1 378 1 528 1 ...
## ..$ 68 : int [1:2, 1:587] 29 1 257 1 258 1 273 1 289 1 ...
## ..$ 69 : int [1:2, 1:591] 1 1 96 1 201 1 273 1 322 4 ...
## ..$ 70 : int [1:2, 1:548] 268 1 322 6 339 1 374 1 378 1 ...
## ..$ 71 : int [1:2, 1:547] 89 1 258 1 273 1 289 1 322 2 ...
## ..$ 72 : int [1:2, 1:558] 87 3 252 1 288 1 322 6 380 1 ...
## ..$ 73 : int [1:2, 1:516] 1 1 87 1 258 1 273 1 285 1 ...
## ..$ 74 : int [1:2, 1:515] 98 1 258 1 273 1 314 1 322 4 ...
## ..$ 75 : int [1:2, 1:567] 1 1 3 1 87 3 221 1 258 1 ...
## ..$ 76 : int [1:2, 1:576] 2 1 76 1 87 1 203 1 273 2 ...
## ..$ 77 : int [1:2, 1:552] 178 1 258 1 273 2 314 1 320 2 ...
## ..$ 78 : int [1:2, 1:483] 232 1 252 1 293 1 314 1 322 2 ...
## ..$ 79 : int [1:2, 1:548] 97 1 98 1 180 1 273 1 314 1 ...
## ..$ 80 : int [1:2, 1:562] 88 1 273 2 285 2 313 1 378 1 ...
## ..$ 81 : int [1:2, 1:549] 273 1 313 1 320 2 321 1 322 2 ...
## ..$ 82 : int [1:2, 1:564] 1 1 88 1 116 1 156 1 222 1 ...
## ..$ 83 : int [1:2, 1:648] 2 1 98 1 142 1 150 1 255 1 ...
## ..$ 84 : int [1:2, 1:598] 87 1 98 1 150 1 173 1 260 1 ...
## ..$ 85 : int [1:2, 1:544] 87 2 194 1 285 1 290 1 302 1 ...
## ..$ 86 : int [1:2, 1:562] 39 1 70 1 87 2 284 2 290 2 ...
## ..$ 87 : int [1:2, 1:574] 87 1 322 3 346 2 348 1 358 1 ...
## ..$ 88 : int [1:2, 1:515] 87 3 88 1 107 1 320 2 322 1 ...
## ..$ 89 : int [1:2, 1:579] 258 1 283 1 313 1 314 8 320 4 ...
## ..$ 90 : int [1:2, 1:477] 1 2 87 2 96 1 128 1 322 2 ...
## ..$ 91 : int [1:2, 1:570] 1 1 98 1 258 1 285 1 289 1 ...
## ..$ 92 : int [1:2, 1:480] 87 1 208 1 252 1 285 1 322 1 ...
## ..$ 93 : int [1:2, 1:552] 87 1 98 2 119 1 180 1 258 1 ...
## ..$ 94 : int [1:2, 1:487] 4 1 87 1 180 1 258 3 320 1 ...
## ..$ 95 : int [1:2, 1:596] 50 1 94 1 131 1 147 1 294 1 ...
## ..$ 96 : int [1:2, 1:594] 177 2 273 3 300 1 313 1 314 3 ...
## ..$ 97 : int [1:2, 1:581] 82 1 87 1 196 1 216 2 305 1 ...
## ..$ 98 : int [1:2, 1:597] 87 1 98 1 181 1 258 1 284 1 ...
## ..$ 99 : int [1:2, 1:543] 139 1 180 1 215 1 216 1 217 1 ...
## .. [list output truncated]
## $ vocab : chr [1:14037] "---" "----" "-----" "------" ...
## $ meta :'data.frame': 236 obs. of 3 variables:
## ..$ id : chr [1:236] "s10_e1" "s10_e10" "s10_e11" "s10_e12" ...
## ..$ season : chr [1:236] "s10" "s10" "s10" "s10" ...
## ..$ transcript: chr [1:236] "[Scene: Barbados, Monica and Chandler's Room. They both enter from Ross's room. Monica still has her big, frizz"| __truncated__ "[Scene: Central Perk. Everybody's sitting on the couch and Monica is eating a chunk of cake.] Mmh... this cake "| __truncated__ "[Scene: Central Perk. Everyone's sitting on the couch. Monica and Joey enter.] Hey guys! Hey, let me tell them!"| __truncated__ "[Scene: Central Perk. Joey's sitting on the couch and Phoebe enters.] Oh, hey Joey. Uh, hey. Listen, I need to "| __truncated__ ...
## $ docs.removed: int(0)
## - attr(*, "class")= chr "textProcessor"
data_stm_processed <- prepDocuments(documents = data_stm_clean$documents,
vocab = data_stm_clean$vocab,
meta = data_stm_clean$meta)
## Removing 7663 of 14037 terms (7663 of 129661 tokens) due to frequency
## Your corpus now has 236 documents, 6374 terms and 121998 tokens.
str(data_stm_processed)
## List of 7
## $ documents :List of 236
## ..$ 1 : int [1:2, 1:509] 44 1 67 1 70 2 80 1 84 1 ...
## ..$ 2 : int [1:2, 1:525] 24 1 53 1 75 1 106 1 112 2 ...
## ..$ 3 : int [1:2, 1:609] 24 1 94 1 112 2 125 5 142 1 ...
## ..$ 4 : int [1:2, 1:614] 80 1 97 1 107 1 112 1 118 2 ...
## ..$ 5 : int [1:2, 1:487] 84 1 97 1 99 1 112 1 128 3 ...
## ..$ 6 : int [1:2, 1:486] 24 1 25 1 30 1 33 1 44 2 ...
## ..$ 7 : int [1:2, 1:473] 8 1 80 2 84 1 106 2 112 3 ...
## ..$ 8 : int [1:2, 1:502] 44 1 80 2 101 2 107 1 112 1 ...
## ..$ 9 : int [1:2, 1:543] 91 1 112 3 123 3 154 1 168 3 ...
## ..$ 10 : int [1:2, 1:341] 112 1 137 2 168 4 191 2 200 6 ...
## ..$ 11 : int [1:2, 1:437] 84 1 85 1 93 1 112 4 119 1 ...
## ..$ 12 : int [1:2, 1:455] 35 1 75 1 84 3 90 2 93 3 ...
## ..$ 13 : int [1:2, 1:542] 8 1 44 1 75 1 97 1 110 2 ...
## ..$ 14 : int [1:2, 1:573] 44 1 84 2 112 1 122 1 123 2 ...
## ..$ 15 : int [1:2, 1:545] 25 1 80 1 84 3 98 1 106 1 ...
## ..$ 16 : int [1:2, 1:571] 70 1 72 1 84 3 91 1 107 1 ...
## ..$ 17 : int [1:2, 1:519] 44 2 106 1 107 1 112 1 123 3 ...
## ..$ 18 : int [1:2, 1:468] 44 1 105 1 112 5 123 8 130 1 ...
## ..$ 19 : int [1:2, 1:649] 80 1 86 1 94 1 106 1 110 1 ...
## ..$ 20 : int [1:2, 1:538] 7 1 20 1 24 1 56 1 75 1 ...
## ..$ 21 : int [1:2, 1:545] 44 1 70 1 84 1 112 1 122 1 ...
## ..$ 22 : int [1:2, 1:424] 1 1 33 2 84 1 111 1 112 1 ...
## ..$ 23 : int [1:2, 1:466] 23 1 33 1 68 1 80 1 93 1 ...
## ..$ 24 : int [1:2, 1:435] 80 1 112 2 114 1 176 1 191 1 ...
## ..$ 25 : int [1:2, 1:509] 107 1 112 1 122 1 147 1 154 1 ...
## ..$ 26 : int [1:2, 1:709] 84 1 93 1 106 1 107 1 108 2 ...
## ..$ 27 : int [1:2, 1:687] 80 1 106 3 115 1 118 1 122 1 ...
## ..$ 28 : int [1:2, 1:486] 2 1 3 1 19 1 84 1 112 2 ...
## ..$ 29 : int [1:2, 1:526] 33 1 44 1 147 1 149 2 166 1 ...
## ..$ 30 : int [1:2, 1:487] 44 1 84 1 86 1 107 1 112 3 ...
## ..$ 31 : int [1:2, 1:476] 1 2 94 1 106 3 107 1 112 6 ...
## ..$ 32 : int [1:2, 1:494] 84 1 92 1 112 2 126 1 137 1 ...
## ..$ 33 : int [1:2, 1:444] 19 1 112 4 142 1 147 1 185 2 ...
## ..$ 34 : int [1:2, 1:491] 84 1 95 1 110 1 112 1 117 1 ...
## ..$ 35 : int [1:2, 1:533] 2 1 24 2 35 1 67 1 80 1 ...
## ..$ 36 : int [1:2, 1:482] 44 2 80 1 91 3 98 2 106 1 ...
## ..$ 37 : int [1:2, 1:522] 23 1 24 1 74 1 98 1 106 2 ...
## ..$ 38 : int [1:2, 1:506] 90 1 94 1 110 2 112 2 142 1 ...
## ..$ 39 : int [1:2, 1:536] 35 1 44 1 55 1 80 2 107 1 ...
## ..$ 40 : int [1:2, 1:479] 84 2 94 1 98 1 106 1 107 1 ...
## ..$ 41 : int [1:2, 1:477] 44 1 84 1 93 1 112 3 166 1 ...
## ..$ 42 : int [1:2, 1:512] 17 1 81 1 84 1 104 1 110 1 ...
## ..$ 43 : int [1:2, 1:431] 1 1 2 1 80 1 111 1 124 1 ...
## ..$ 44 : int [1:2, 1:487] 107 1 112 6 135 1 143 1 144 1 ...
## ..$ 45 : int [1:2, 1:496] 91 1 93 1 107 1 112 6 137 1 ...
## ..$ 46 : int [1:2, 1:507] 21 1 110 1 112 3 118 1 121 1 ...
## ..$ 47 : int [1:2, 1:473] 84 1 103 1 110 1 112 1 122 1 ...
## ..$ 48 : int [1:2, 1:512] 69 1 91 1 112 5 115 2 147 1 ...
## ..$ 49 : int [1:2, 1:455] 73 1 84 1 112 1 114 1 142 1 ...
## ..$ 50 : int [1:2, 1:475] 61 2 80 1 112 4 136 1 147 1 ...
## ..$ 51 : int [1:2, 1:446] 91 1 97 1 112 1 126 1 147 1 ...
## ..$ 52 : int [1:2, 1:504] 93 1 112 4 135 1 137 2 156 1 ...
## ..$ 53 : int [1:2, 1:540] 73 1 84 1 90 1 97 1 107 1 ...
## ..$ 54 : int [1:2, 1:428] 86 1 110 1 112 5 126 1 128 1 ...
## ..$ 55 : int [1:2, 1:543] 112 3 147 2 154 1 191 2 198 1 ...
## ..$ 56 : int [1:2, 1:501] 76 1 78 1 84 1 93 1 112 5 ...
## ..$ 57 : int [1:2, 1:514] 106 1 112 2 118 1 146 1 149 1 ...
## ..$ 58 : int [1:2, 1:528] 107 2 110 1 112 1 134 1 156 2 ...
## ..$ 59 : int [1:2, 1:456] 84 1 108 1 110 2 115 1 137 2 ...
## ..$ 60 : int [1:2, 1:471] 78 1 93 1 97 1 112 5 114 2 ...
## ..$ 61 : int [1:2, 1:465] 71 1 84 1 88 1 112 4 147 1 ...
## ..$ 62 : int [1:2, 1:463] 80 1 84 1 97 1 112 1 133 1 ...
## ..$ 63 : int [1:2, 1:455] 94 1 112 5 123 1 124 1 140 2 ...
## ..$ 64 : int [1:2, 1:407] 80 1 112 4 131 1 137 1 154 2 ...
## ..$ 65 : int [1:2, 1:461] 112 4 154 1 186 1 191 1 199 1 ...
## ..$ 66 : int [1:2, 1:539] 79 1 88 1 112 2 156 3 172 1 ...
## ..$ 67 : int [1:2, 1:482] 24 1 91 1 126 1 142 1 202 1 ...
## ..$ 68 : int [1:2, 1:542] 79 1 80 1 84 1 93 1 106 1 ...
## ..$ 69 : int [1:2, 1:551] 1 1 31 1 84 1 112 4 140 1 ...
## ..$ 70 : int [1:2, 1:515] 112 6 140 1 142 1 144 1 156 1 ...
## ..$ 71 : int [1:2, 1:527] 26 1 80 1 84 1 93 1 112 2 ...
## ..$ 72 : int [1:2, 1:529] 24 3 78 1 112 6 143 1 147 2 ...
## ..$ 73 : int [1:2, 1:493] 1 1 24 1 80 1 84 1 91 1 ...
## ..$ 74 : int [1:2, 1:497] 33 1 80 1 84 1 107 1 112 4 ...
## ..$ 75 : int [1:2, 1:534] 1 1 3 1 24 3 65 1 80 1 ...
## ..$ 76 : int [1:2, 1:559] 2 1 20 1 24 1 55 1 84 2 ...
## ..$ 77 : int [1:2, 1:523] 80 1 84 2 107 1 110 2 111 1 ...
## ..$ 78 : int [1:2, 1:461] 68 1 78 1 96 1 107 1 112 2 ...
## ..$ 79 : int [1:2, 1:523] 32 1 33 1 49 1 84 1 107 1 ...
## ..$ 80 : int [1:2, 1:537] 25 1 84 2 91 2 106 1 142 1 ...
## ..$ 81 : int [1:2, 1:516] 84 1 106 1 110 2 111 1 112 2 ...
## ..$ 82 : int [1:2, 1:518] 1 1 25 1 66 1 80 1 81 2 ...
## ..$ 83 : int [1:2, 1:606] 2 1 33 1 41 1 45 1 82 1 ...
## ..$ 84 : int [1:2, 1:568] 24 1 33 1 45 1 48 1 82 1 ...
## ..$ 85 : int [1:2, 1:522] 24 2 91 1 94 1 101 1 110 1 ...
## ..$ 86 : int [1:2, 1:537] 19 1 24 2 90 2 94 2 107 4 ...
## ..$ 87 : int [1:2, 1:530] 24 1 112 3 121 2 122 1 128 1 ...
## ..$ 88 : int [1:2, 1:498] 24 3 25 1 35 1 110 2 112 1 ...
## ..$ 89 : int [1:2, 1:546] 80 1 89 1 106 1 107 8 110 4 ...
## ..$ 90 : int [1:2, 1:452] 1 2 24 2 31 1 112 2 154 1 ...
## ..$ 91 : int [1:2, 1:533] 1 1 33 1 80 1 91 1 93 1 ...
## ..$ 92 : int [1:2, 1:466] 24 1 57 1 78 1 91 1 112 1 ...
## ..$ 93 : int [1:2, 1:513] 24 1 33 2 49 1 80 1 84 1 ...
## ..$ 94 : int [1:2, 1:463] 4 1 24 1 49 1 80 3 110 1 ...
## ..$ 95 : int [1:2, 1:563] 30 1 43 1 97 1 112 4 118 1 ...
## ..$ 96 : int [1:2, 1:564] 84 3 99 1 106 1 107 3 110 2 ...
## ..$ 97 : int [1:2, 1:541] 24 1 53 1 62 2 102 1 110 4 ...
## ..$ 98 : int [1:2, 1:560] 24 1 33 1 80 1 90 1 110 1 ...
## ..$ 99 : int [1:2, 1:512] 40 1 49 1 61 1 62 1 79 1 ...
## .. [list output truncated]
## $ vocab : chr [1:6374] "---" "----" "-----" "------" ...
## $ meta :'data.frame': 236 obs. of 3 variables:
## ..$ id : chr [1:236] "s10_e1" "s10_e10" "s10_e11" "s10_e12" ...
## ..$ season : chr [1:236] "s10" "s10" "s10" "s10" ...
## ..$ transcript: chr [1:236] "[Scene: Barbados, Monica and Chandler's Room. They both enter from Ross's room. Monica still has her big, frizz"| __truncated__ "[Scene: Central Perk. Everybody's sitting on the couch and Monica is eating a chunk of cake.] Mmh... this cake "| __truncated__ "[Scene: Central Perk. Everyone's sitting on the couch. Monica and Joey enter.] Hey guys! Hey, let me tell them!"| __truncated__ "[Scene: Central Perk. Joey's sitting on the couch and Phoebe enters.] Oh, hey Joey. Uh, hey. Listen, I need to "| __truncated__ ...
## $ words.removed : chr [1:7663] "---crotch" "---go" "--alittl" "--anyway" ...
## $ docs.removed : NULL
## $ tokens.removed: int 7663
## $ wordcounts : int [1:14037] 38 19 8 6 2 1 1 1 1 1 ...
#data_stm_processed$words.removed
The stm package has a function called
searchK which allows the user to:
specify a range of values for k
run STM models for each value of k
and output multiple goodness-of-fit measures
The syntax of this function is very similar to the main
stm function, except that the user specifies a range for k
as one of the arguments. Therefore, searchK can take a long
time to run.
In today’s class, you skip running this part and look at the output directly.
stmcompare1 <- searchK(documents = data_stm_processed$documents,
vocab = data_stm_processed$vocab,
K = c(5:11),
prevalence = ~ season, # covariates predicting topic prevalence (document-topic)
# content = ~ season # covariates predicting predict topic content (topic-term)
data = data_stm_processed$meta, # data for covariates
init.type = "Spectral", # spectral initialization is recommended for stability
max.em.its = 75,
verbose = TRUE) # print messages
plot(stmcompare1)
We want to maximize held-out likelihood, semantic coherence, lower-bound of the marginal likelihood (internal measure of fit), and minimize residuals. As with LDA, there is no sigle best k value - we will use k = 6 in this tutorial.
Recall that earlier we preprocessed the data ourselves and generated
a data object clean_data. We then used
cast_dtm to generate a form of document-term matrix for
LDA.
Besides objects returned by built-in function
textProcessor and prepDocuments, the
stm package also takes document-term matrix, but in a
specific form. So below, we use cast_sparse to generate the
form of document-term matrix that stm package recognizes
and also the meta data the package will use.
data_custom_processed <- clean_data %>%
cast_sparse(id, stem, n)
covariates <- clean_data %>%
distinct(id, season)
head(covariates)
## # A tibble: 6 × 2
## # Groups: id [6]
## id season
## <chr> <chr>
## 1 s4_e24 s4
## 2 s6_e19 s6
## 3 s4_e21 s4
## 4 s7_e21 s7
## 5 s6_e9 s6
## 6 s9_e9 s9
stmcompare2 <- searchK(documents = data_custom_processed, # when using a matrix rather than a stm object, only documents need to be specified; vocab should be omitted
K = c(5:11),
N = floor(0.1 * nrow(covariates)), # need to set this when using a matrix. N specifies the number of documents to be held-out as testing set, with the default as floor(0.1 * length(documents)) for stm objects. Because we are not using stm objects, we substitute # of documents with # of rows in the covariates data (they are the same).
prevalence = ~ season,
data = covariates, # covariate data frame we just constructed
init.type = "Spectral",
max.em.its = 75,
verbose = TRUE)
plot(stmcompare2)
friends_stm1 <- stm(documents = data_stm_processed$documents,
vocab = data_stm_processed$vocab,
K = 6,
prevalence = ~ season,
data = data_stm_processed$meta,
init.type = "Spectral",
max.em.its = 75,
verbose = TRUE)
friends_stm2 <- stm(documents = data_custom_processed,
K = 6,
prevalence = ~ season,
data = covariates,
init.type = "Spectral",
max.em.its = 75,
verbose = TRUE)
Quick glance at topic prevalence and keywords:
plot(friends_stm1)
plot(friends_stm2)
Different metrics:
see ??labelTopics for full documentation. Highest Prob
represents words within each topic with the highest probability
(inferred directly from topic-word distribution paramete. FREX
represents words that are both frequent and exclusive, identifying words
that distinguish topics.
labelTopics(friends_stm1)
## Topic 1 Top Words:
## Highest Prob: yeah, just, okay, well, hey, dont, yknow
## FREX: kathi, joshua, mark, date, chef, ralph, stripper
## Lift: centerpiec, ahh-ahh, alessandro, allesandro, ambush, arcad, atmospher
## Score: snore, joshua, yknow, kathi, kim, joanna, allesandro
## Topic 2 Top Words:
## Highest Prob: okay, chandler, just, yeah, monica, dont, well
## FREX: ball, chip, bike, cat, thanksgiv, flashback, turkey
## Lift: ahh-ahhh, ahhhhh, bast, bedtim, bendi, bicep, blackjack
## Score: touchdown, bonni, thanksgiv, huddl, yemen, poker, jessica
## Topic 3 Top Words:
## Highest Prob: know, just, well, dont, yeah, hey, like
## FREX: mike, woah, uhm, emma, charli, eddi, whatr
## Lift: -caus, aaahh, aaahhh, aaahhhh, aahhh, academ, ad-lib
## Score: uhm, emma, duncan, woah, charli, eddi, alright
## Topic 4 Top Words:
## Highest Prob: ross, okay, rachel, just, yeah, well, joey
## FREX: emili, london, jill, joke, rosss, honeymoon, video
## Lift: --re, -anyway, -iw, -will, beef-tip, blargon, but
## Score: clip, emili, waltham, thee, chloe, jill, yknow
## Topic 5 Top Words:
## Highest Prob: okay, just, yeah, right, well, know, get
## FREX: car, paul, porsch, chair, boot, ursula, barri
## Lift: aaron, aruba, batman, breezi, knockin, oberman, omelet
## Score: breezi, paul, batman, porsch, tuxedo, semi-priv, yknow
## Topic 6 Top Words:
## Highest Prob: okay, just, know, yeah, dont, get, hey
## FREX: marcel, monkey, santa, christma, alright, erica, cmon
## Lift: heartbeat, part-jewish, ygo, armadillo, antacid, crier, dent
## Score: newark, marcel, erica, alright, tulsa, armadillo, emma
labelTopics(friends_stm2)
## Topic 1 Top Words:
## Highest Prob: apart, love, plai, call, hand, tabl, cat
## FREX: paolo, woah, lotteri, eddi, ethan, poker, cat
## Lift: consuela, gucci, tenur, doesnt, mee, omnipot, goodacr
## Score: love, lotteri, erica, ethan, eddi, dont, nina
## Topic 2 Top Words:
## Highest Prob: babi, joei, ben, emma, talk, kid, parti
## FREX: babi, ben, emma, nurs, ami, bike, santa
## Lift: m'appel, hulk, eyr, tarantula, francett, telemarket, nomin
## Score: joei, babi, emma, talk, gavin, ami, je
## Topic 3 Top Words:
## Highest Prob: ross, chandler, monica, rachel, joei, emili, phoeb
## FREX: ross, monica, emili, chandler, laura, phoeb, london
## Lift: bagpip, lockhart, escort, barcelona, yemen, huddl, thee
## Score: ross, monica, chandler, rachel, joei, phoeb, emili
## Topic 4 Top Words:
## Highest Prob: gui, kiss, apart, parti, job, night, offic
## FREX: gui, angela, charli, sandi, mari, cassi, minsk
## Lift: hobart, thompson, gladi, comet, lennart, hilda, blowfish
## Score: gui, charli, cassi, sandi, gladi, platform, minsk
## Topic 5 Top Words:
## Highest Prob: rachel, date, move, marri, live, love, phoeb
## FREX: annul, date, kathi, move, perk, central, time
## Lift: sergei, russ, tobi, annul, thirtieth, 007, cujo
## Score: rachel, annul, mona, phoeb, love, sergei, mockol
## Topic 6 Top Words:
## Highest Prob: ring, joey', wait, chandler, god, gotta, marcel
## FREX: mindi, barri, paul, pete, ring, marcel, heckl
## Lift: mindi, gandolf, ronni, processor, heston, caitlin, gala
## Score: chandler, mindi, pete, marcel, wait, vinc, heckl
2. Document-Topic
findThoughts(friends_stm2, # model we fitted
texts = data$id, # texts = data$transcripts for full transcripts
n = 3, # three most relevant documents
topics = 5) # for topic 5
##
## Topic 5:
## s1_e11
## s5_e14
## s3_e11
IMDB page for s1e6
IMDB page for s5e14
IMDB page for s3e11
3. Effects of covariates
season_effect <- estimateEffect(1: 6 ~ season,
friends_stm2,
meta = covariates,
uncertainty = "Global")
summary(season_effect, topics = 2)
##
## Call:
## estimateEffect(formula = 1:6 ~ season, stmobj = friends_stm2,
## metadata = covariates, uncertainty = "Global")
##
##
## Topic 2:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.12202 0.07289 1.674 0.0955 .
## seasons10 0.23850 0.11252 2.120 0.0351 *
## seasons2 0.02267 0.10811 0.210 0.8341
## seasons3 -0.10789 0.10022 -1.077 0.2828
## seasons4 0.02764 0.10616 0.260 0.7948
## seasons5 0.04048 0.10195 0.397 0.6917
## seasons6 -0.07988 0.09959 -0.802 0.4233
## seasons7 0.22058 0.11140 1.980 0.0489 *
## seasons8 0.46942 0.10977 4.277 0.000028 ***
## seasons9 0.15165 0.10795 1.405 0.1615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(season_effect,
covariate = "season",
topics = c(1:6),
model = friends_stm2,
method = "difference",
cov.value1 = "s10", # "treatment" group
cov.value2 = "s1", # "control" group (baseline of comparison)
xlab = "More likely to be in s1 More likely to be in s10",
main = "Effect of s1 vs s10",
labeltype = "custom",
custom.labels = c("Topic 1", "Topic 2","Topic 3","Topic 4","Topic 5", "Topic 6"))
Conduct LDA topic modeling by engaging in the following steps:
Step 0: Download “beyonce.csv” to your folder. Open a new R
script and save it in your folder. Set working directory to the folder.
Use rm(list=ls()) to empty your environment.
Step 1: Tokenize the lyrics column. Inspect the word frequency
Step 2: Remove stop words (use your best judgment to construct a stop words dictionary)
Step 3: Stemming
Step 4: Use tf-idf for feature selection
Step 5: Create a document-term matrix
Step 6: Find an appropriate k value
Step 7: Define hyperparameters
Step 8: RUN LDA
Step 9: What are the words under each topic?
Step 10: What are some of the representative songs reflecting each topic?