Summary of the methods and objectives. This code is adapted from: tidytext website ch 4, ch 8, humanities data in R book, stm package paper, and quanteda package. Insert real citations here, or at end. Sources link/cite.
Load text data, then create separate tables: document text corpus and document metadata. This example is comprised of separate documents within the whole doc, marked with corresponding headers. One table has each doc as a continuous text block; the other table has each line as a separate entry. Adjust for corpus.
The first section of code should first remove all of the interviewer text: have only the caregiver text remaining.
source_text <- "C:/Users/newsomevw/OneDrive - National Institutes of Health/Desktop/TM scripts/interview_copilot_sample.txt"
original_interview <- read_lines(source_text, locale = locale(encoding = "UTF-8"))
original_interview <- original_interview[original_interview != ""]
interview_df <- tibble(text = original_interview) %>%
mutate(excerpt = ifelse(str_detect(text, "^EXCERPT"), text, NA), .before = 1) %>%
fill(excerpt, .direction = "down") %>%
filter(!str_detect(text, "^EXCERPT")) %>%
extract(text, into = c("speaker", "text"), regex = "^(SW|CG):\\s*(.*)") %>%
mutate(text = gsub("([0-9]{2})-([0-9]{3})", "\\1\\2", text)) %>%
mutate(line_id = row_number(), .before = 1)
# incorporate: %>% filter(speaker != "SW")
interview_text <- interview_df %>%
select(line_id, text)
interview_whole_df <- interview_df %>%
select(-line_id) %>%
group_by(excerpt, speaker) %>%
summarise(text = paste(text, collapse = " "), .groups = "drop") %>%
mutate(doc_id = row_number(), .before = 1)
interview_whole_text <- interview_whole_df %>%
select(doc_id, text)
interview_df <- interview_df %>%
select(-text)
interview_whole_df <- interview_whole_df %>%
select(-text)
Metadata can be loaded separately, then create one, unified interview whole text table. Basic survey questions are included and may be explored.
source_metadata <- "C:/Users/newsomevw/OneDrive - National Institutes of Health/Desktop/TM scripts/interview_copilot_csv.csv"
meta_df <- read_csv(source_metadata) %>%
mutate(id = row_number(), .before = 1)
interview_whole_df <- interview_whole_df %>%
mutate(excerpt = as.integer(str_remove(excerpt, "EXCERPT ")))
interview_whole_df <- interview_whole_df %>%
filter(speaker == "CG") %>%
inner_join(meta_df, by = c("excerpt" = "id")) %>%
bind_rows(interview_whole_df %>%
filter(speaker == "SW")) %>%
arrange(doc_id)
interview_whole_df %>%
mutate(across(where(is.character), as.factor)) %>%
summary() %>%
kable(format = "simple")
| doc_id | excerpt | speaker | Gender | Age | Race | MaritalStatus | PsychDistress | RelationalSatisfaction | |
|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.00 | Min. : 1.00 | CG:15 | F : 9 | Min. :33.0 | B : 7 | M : 6 | N : 8 | N : 8 | |
| 1st Qu.: 8.25 | 1st Qu.: 4.25 | SW:15 | M : 6 | 1st Qu.:38.5 | W : 8 | NM : 9 | Y : 7 | Y : 7 | |
| Median :15.50 | Median : 8.00 | NA | NA’s:15 | Median :42.0 | NA’s:15 | NA’s:15 | NA’s:15 | NA’s:15 | |
| Mean :15.50 | Mean : 8.00 | NA | NA | Mean :42.8 | NA | NA | NA | NA | |
| 3rd Qu.:22.75 | 3rd Qu.:11.75 | NA | NA | 3rd Qu.:46.5 | NA | NA | NA | NA | |
| Max. :30.00 | Max. :15.00 | NA | NA | Max. :53.0 | NA | NA | NA | NA | |
| NA | NA | NA | NA | NA’s :15 | NA | NA | NA | NA |
If necessary, and the text needs to be converted to tabular format for LIWC or other dictionary-based methods, then convert at or before this line.
Use the unnest_tokens function from tidytext to create a token table. Load the library of stop words, then confirm that none of the stop words are useful measures. Plot the top word frequency, observing that stop words are less useful to the corpus dict.
data(stop_words)
tidy_interview_token_withstop <- interview_whole_text %>%
unnest_tokens(word, text)
tidy_interview_token <- tidy_interview_token_withstop %>%
anti_join(stop_words)
# other process step: to interact with the words, limit stop words, or handle dictionary-specific words
# check this point, if the corpus objects make sense for the analysis and the outputs are reliable
tidy_interview_token_withstop %>%
count(word, sort = TRUE) %>%
kable(format = "simple")
| word | n |
|---|---|
| i | 320 |
| the | 236 |
| and | 232 |
| that | 224 |
| to | 220 |
| it | 193 |
| a | 145 |
| was | 141 |
| they | 140 |
| me | 137 |
tidy_interview_token %>%
count(word, sort = TRUE) %>%
kable(format = "simple")
| word | n |
|---|---|
| didn’t | 70 |
| feel | 68 |
| they’d | 57 |
| support | 50 |
| i’d | 40 |
| helped | 32 |
| time | 31 |
| yeah | 31 |
| emotional | 28 |
| it’s | 24 |
These tables show how some of the pronouns and relational words are included in the stop words, and could have meaning for the corpus. The below graph is of the basic word frequency, which is relatively uninformative without proper context.
tidy_interview_token %>%
count(word, sort = TRUE) %>%
filter(n > 15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)
tidy_interview_line_token <- interview_text %>%
unnest_tokens(word, text)
tidy_interview_line_token <- tidy_interview_line_token %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
Tokenize the separate line table for later functions, to see if the line containing the text is important for modeling.
Calculate term frequency-inverse document frequency for each word, relative to the document. Use the bind function from tidytext. (In this sample, even-numbered doc_id code for SW text, and odd-numbered doc_id code for CG text. This will be removed.)
interview_doc_words <- tidy_interview_token %>%
count(doc_id, word, sort = TRUE)
total_doc_words <- interview_doc_words %>%
group_by(doc_id) %>%
summarize(total = sum(n))
interview_doc_words <- left_join(interview_doc_words, total_doc_words)
interview_doc_tf_idf <- interview_doc_words %>%
bind_tf_idf(word, doc_id, n)
interview_doc_tf_idf %>%
arrange(desc(tf_idf)) %>%
kable(format = "simple")
| doc_id | word | n | total | tf | idf | tf_idf |
|---|---|---|---|---|---|---|
| 26 | communication | 4 | 57 | 0.0701754 | 2.708050 | 0.1900386 |
| 28 | boundaries | 3 | 49 | 0.0612245 | 2.708050 | 0.1657990 |
| 17 | 47001 | 7 | 151 | 0.0463576 | 3.401197 | 0.1576714 |
| 14 | spring | 2 | 45 | 0.0444444 | 3.401197 | 0.1511643 |
| 10 | summer | 2 | 49 | 0.0408163 | 3.401197 | 0.1388244 |
| 28 | boundary | 2 | 49 | 0.0408163 | 3.401197 | 0.1388244 |
| 12 | winter | 2 | 42 | 0.0476190 | 2.708050 | 0.1289548 |
| 21 | 63001 | 7 | 157 | 0.0445860 | 2.708050 | 0.1207411 |
| 30 | partnership | 3 | 59 | 0.0508475 | 2.302585 | 0.1170806 |
| 27 | 94001 | 5 | 149 | 0.0335570 | 3.401197 | 0.1141341 |
Basic sentiments are described for the words in the corpus. This can be misinterpreted, given some of these words are negated.
interview_sentiments <- interview_doc_words %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(word, sentiment, sort = TRUE)
interview_sentiments %>%
group_by(sentiment) %>%
slice_max(n, n = 15) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Create a table of word bigrams, to investigate their TF-IDF. Remove stop words from bigrams prior to calculating TF-IDF; this requires separation, filtering, and uniting the bigrams. Adjust n to observe different N-grams, as shown in the examples for n = 2, n = 3.
tidy_interview_bigrams <- interview_whole_text %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram))
tidy_interview_bigrams %>%
count(bigram, sort = TRUE) %>%
kable(format = "simple")
| bigram | n |
|---|---|
| it was | 45 |
| i was | 43 |
| i didn’t | 34 |
| made me | 32 |
| it made | 30 |
| like i | 29 |
| me feel | 29 |
| i felt | 28 |
| and then | 26 |
| felt like | 24 |
bigrams_separated <- tidy_interview_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
interview_bigram_tf_idf <- bigrams_united %>%
count(doc_id, bigram) %>%
bind_tf_idf(doc_id, bigram, n) %>%
arrange(desc(tf_idf))
interview_bigram_tf_idf %>%
kable(format = "simple")
| doc_id | bigram | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| 12 | blend emotional | 1 | 1 | 4.329536 | 4.329536 |
| 12 | difference that’s | 1 | 1 | 4.329536 | 4.329536 |
| 12 | dismissing seriousness | 1 | 1 | 4.329536 | 4.329536 |
| 12 | emotional relief | 1 | 1 | 4.329536 | 4.329536 |
| 12 | nice blend | 1 | 1 | 4.329536 | 4.329536 |
| 12 | stretch presence | 1 | 1 | 4.329536 | 4.329536 |
| 12 | uncertainty i’d | 1 | 1 | 4.329536 | 4.329536 |
| 10 | contagion added | 1 | 1 | 4.162482 | 4.162482 |
| 10 | describing late | 1 | 1 | 4.162482 | 4.162482 |
| 10 | emotional contagion | 1 | 1 | 4.162482 | 4.162482 |
tidy_interview_trigrams <- interview_whole_text %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
filter(!is.na(trigram))
tidy_interview_trigrams %>%
count(trigram, sort = TRUE) %>%
kable(format = "simple")
| trigram | n |
|---|---|
| it made me | 25 |
| made me feel | 23 |
| me feel like | 16 |
| is there anything | 15 |
| like i was | 15 |
| there anything else | 15 |
| feel like i | 12 |
| that feels important | 12 |
| feels important to | 11 |
| i didn’t want | 11 |
trigrams_separated <- tidy_interview_trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
trigrams_filtered <- trigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word)
trigrams_counts <- trigrams_filtered %>%
count(word1, word2, word3, sort = TRUE)
trigrams_united <- trigrams_filtered %>%
unite(trigram, word1, word2, word3, sep = " ")
interview_trigram_tf_idf <- trigrams_united %>%
count(doc_id, trigram) %>%
bind_tf_idf(doc_id, trigram, n) %>%
arrange(desc(tf_idf))
interview_trigram_tf_idf %>%
kable(format = "simple")
| doc_id | trigram | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| 27 | 94002 whoa surprisingly | 1 | 1 | 4.791650 | 4.791650 |
| 27 | feel respected completely | 1 | 1 | 4.791650 | 4.791650 |
| 5 | documenting symptoms rest | 1 | 1 | 4.386185 | 4.386185 |
| 5 | i’d start worrying | 1 | 1 | 4.386185 | 4.386185 |
| 5 | symptoms rest wasn’t | 1 | 1 | 4.386185 | 4.386185 |
| 12 | blend emotional relief | 1 | 1 | 4.386185 | 4.386185 |
| 12 | huge difference that’s | 1 | 1 | 4.386185 | 4.386185 |
| 12 | nice blend emotional | 1 | 1 | 4.386185 | 4.386185 |
| 26 | delegating harder that’s | 1 | 1 | 4.386185 | 4.386185 |
| 26 | human presence mattered | 1 | 1 | 4.386185 | 4.386185 |
Additional analysis on the impact of not and broader negation words (not, no, never, without). A different sentiment lexicon is used, for a numerical sentiment score. Also note that the colors plotted correspond to the original sentiment- the sentiment is opposite, with negation.
bigrams_separated %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE) %>%
kable(format = "simple")
| word1 | word2 | n |
|---|---|---|
| not | just | 5 |
| not | a | 3 |
| not | to | 3 |
| not | because | 2 |
| not | really | 2 |
| not | actually | 1 |
| not | around | 1 |
| not | at | 1 |
| not | before | 1 |
| not | consistently | 1 |
tidy_interview_not_words <- bigrams_separated %>%
filter(word1 == "not") %>%
inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
tidy_interview_not_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"not\"")
negation_words <- c("not", "no", "never", "without")
bigrams_separated %>%
filter(word1 %in% negation_words) %>%
count(word1, word2, sort = TRUE) %>%
kable(format = "simple")
| word1 | word2 | n |
|---|---|---|
| no | one | 5 |
| not | just | 5 |
| without | me | 5 |
| not | a | 3 |
| not | to | 3 |
| no | fuss | 2 |
| no | guilt | 2 |
| not | because | 2 |
| not | really | 2 |
| never | asked | 1 |
tidy_interview_negation_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE)
tidy_interview_negation_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by a negation word")
Plot a network of bigrams, weighted by occurrence frequency. Then show the difference between a directed and undirected network, for the corpus. Adjust for corpus. This graph would be improved with pre-processing to highlight bigrams with dictionary-specific words, relational words, or highest-frequency words.
bigram_graph <- bigrams_counts %>%
filter(n > 2) %>%
graph_from_data_frame()
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
set.seed(2020)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
Use the clean NLP package word features, lemmatization, and related word dictionaries. For NLP annotation, first load the right dict, then automatically compute the word characters. Highlight the most used nouns across docs and most used adjectives, also grouped by doc. Check how much we-language is used compared to total language.
cnlp_init_udpipe("english")
invisible(tidy_interview_nlp <- cnlp_annotate(
interview_whole_text, doc_name = "doc_id", text_name = "text")$token)
## Processed document 10 of 30
## Processed document 20 of 30
## Processed document 30 of 30
tidy_interview_nlp <- tidy_interview_nlp %>%
filter(upos != "PUNCT")
# select(tidy_interview_nlp, token, xpos, feats, tid_source, relation)
tidy_interview_nlp %>%
kable(format = "simple")
| doc_id | sid | tid | token | token_with_ws | lemma | upos | xpos | feats | tid_source | relation |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Yeah | Yeah | yeah | INTJ | UH | NA | 7 | discourse |
| 1 | 1 | 3 | um | um | um | INTJ | UH | NA | 1 | discourse |
| 1 | 1 | 5 | it | it | it | PRON | PRP | Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs | 7 | nsubj |
| 1 | 1 | 6 | was | was | be | AUX | VBD | Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin | 7 | cop |
| 1 | 1 | 7 | rough | rough | rough | ADJ | JJ | Degree=Pos | 0 | root |
| 1 | 2 | 1 | I | I | I | PRON | PRP | Case=Nom|Number=Sing|Person=1|PronType=Prs | 2 | nsubj |
| 1 | 2 | 2 | mean | mean | mean | VERB | VBP | Mood=Ind|Tense=Pres|VerbForm=Fin | 8 | discourse |
| 1 | 2 | 4 | unpredictable | unpredictable | unpredictable | ADJ | JJ | Degree=Pos | 8 | nsubj |
| 1 | 2 | 5 | is | is | be | AUX | VBZ | Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin | 8 | cop |
| 1 | 2 | 6 | the | the | the | DET | DT | Definite=Def|PronType=Art | 8 | det |
tidy_interview_nlp |>
filter(upos == "NOUN") |>
group_by(lemma) |>
summarize(count = n()) |>
arrange(desc(count)) %>%
kable(format = "simple")
| lemma | count |
|---|---|
| support | 46 |
| thing | 42 |
| day | 33 |
| time | 33 |
| period | 25 |
| people | 24 |
| kind | 20 |
| symptom | 18 |
| help | 17 |
| way | 17 |
tidy_interview_nlp |>
filter(upos == "ADJ") |>
group_by(doc_id, lemma) |>
summarize(count = n()) |>
arrange(desc(count)) |>
slice(1:8) |>
summarize(lemma_paste = paste(lemma, collapse = "; ")) %>%
kable(format = "simple")
| doc_id | lemma_paste |
|---|---|
| 1 | more; different; much; small; acknowledgment; alone; best; big |
| 2 | appraisal; comfortable; different; dismissive; else; emotional; frustrating; huge |
| 3 | much; whole; actual; afraid; blue; close; closer; confus |
| 4 | biggest; daily; else; emotional; heavy; high; important; meaningful |
| 5 | huge; spiral; wrong; alone; anxious; bad; big; brave |
| 6 | better; consistent; else; high; important; intense; meaningful; powerful |
| 7 | little; alone; calm; certain; complicate; else; emotional; fancy |
| 8 | late; unpredictable; else; helpful; important; involve; long; low |
| 9 | new; small; emotional; calm; dependable; different; early; else |
| 10 | else; emotional; late; more; important; involved; larger; new |
tidy_interview_nlp |>
filter(doc_id %% 2 != 0) |>
group_by(doc_id) |>
summarize(
n_plural = sum(grepl("Number=Plur", feats)),
n_singular = sum(grepl("Number=Sing", feats)),
ratio_plural_total = n_plural / (n_plural + n_singular)
) |>
arrange(desc(ratio_plural_total)) %>%
kable(format = "simple")
| doc_id | n_plural | n_singular | ratio_plural_total |
|---|---|---|---|
| 27 | 44 | 98 | 0.3098592 |
| 15 | 45 | 116 | 0.2795031 |
| 5 | 45 | 121 | 0.2710843 |
| 7 | 37 | 106 | 0.2587413 |
| 3 | 50 | 150 | 0.2500000 |
| 9 | 44 | 137 | 0.2430939 |
| 11 | 36 | 115 | 0.2384106 |
| 29 | 34 | 119 | 0.2222222 |
| 13 | 28 | 105 | 0.2105263 |
| 1 | 42 | 164 | 0.2038835 |
Modify the functions here for proper context with surrounding words, significant meaning to the corpus, and correct use of linguistic markers.
Example words that may encompass important dimensions of the text, then highlight the intersection with language data.
dim_words = c('only', 'owner', 'affected', 'others', 'else', 'problem', 'responsible', 'deal', 'plan', 'manage', 'cope', 'rely', 'support', 'open', 'depend', 'self', 'care')
# placeholder for relation words
tidy_interview_nlp %>%
filter(lemma %in% dim_words) %>%
kable(format = "simple")
| doc_id | sid | tid | token | token_with_ws | lemma | upos | xpos | feats | tid_source | relation |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 29 | 11 | support | support | support | NOUN | NN | Number=Sing | 13 | nsubj |
| 2 | 2 | 11 | support | support | support | NOUN | NN | Number=Sing | 12 | nsubj |
| 2 | 20 | 9 | else | else | else | ADJ | JJ | Degree=Pos | 8 | amod |
| 2 | 20 | 12 | support | support | support | NOUN | NN | Number=Sing | 13 | compound |
| 3 | 38 | 5 | support | support | support | NOUN | NN | Number=Sing | 3 | nmod |
| 3 | 60 | 9 | support | support | support | NOUN | NN | Number=Sing | 6 | nmod |
| 4 | 1 | 15 | plan | plan | plan | NOUN | NN | Number=Sing | 16 | nsubj |
| 4 | 17 | 12 | else | else | else | ADJ | JJ | Degree=Pos | 11 | amod |
| 4 | 17 | 21 | support | support | support | NOUN | NN | Number=Sing | 22 | compound |
| 5 | 17 | 20 | managed | managed | manage | VERB | VBD | Mood=Ind|Tense=Past|VerbForm=Fin | 13 | advcl |
First, pre-process the data and organize proper variable names for the stm package. The tokenization performed here has a different working format compared to tidytext. Decide lower threshold, with the default being occurrence in > 1 document. “Just” should also be removed, as its high frequency detracts from topical domain words. May un-comment pdf lines to save outputs to pdf format.
set.seed(23456)
processed <- textProcessor(documents = interview_whole_text$text, metadata = interview_whole_df)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
out <- prepDocuments(documents = processed$documents,
vocab = processed$vocab,
meta = processed$meta,
lower.thresh = 1)
## Removing 664 of 1225 terms (664 of 3504 tokens) due to frequency
## Your corpus now has 30 documents, 561 terms and 2840 tokens.
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
# plotRemoved(processed$documents, lower.thresh = seq(1, 200, by = 100))
The plotRemoved function is helpful for exmaining different thresholds, but minimally useful with a small number of documents and less threshold decision-making.
Fit the starting model, adjusting thresholds as needed. Then try running stm with a range of K values to see what number of topics works best. Shortdoc run, uses first 200 chars and confirm success. Search a range of K topic numbers, adjust for corpus. Examine plots to try and find the highest exclusivity and highest (least negative) semantic coherence, and other optimal values.
# create the out object in the prior chunk
shortdoc <- substr(interview_whole_text$text, 1, 200)
invisible(modelPrevFit <- stm(documents = out$documents, vocab = out$vocab,
K = 10, prevalence =~ doc_id,
max.em.its = 75,
data = out$meta, init.type = "Spectral", verbose = FALSE))
invisible(capture.output(storage <- searchK(out$documents, out$vocab, K = 3:15,
prevalence =~ doc_id, data = meta)))
t <- storage$results[[1]]
t <- storage$results[[2]]
plot(storage)
plot_data <- storage$results %>%
mutate(
semcoh = as.numeric(unlist(semcoh)),
exclus = as.numeric(unlist(exclus)),
K = as.factor(unlist(K))
)
ggplot(plot_data, aes(x = semcoh, y = exclus, group = 1)) +
geom_line() +
geom_point() +
geom_text(aes(label = K), vjust = -1) +
labs(title = "Topic Model Diagnostics",
x = "Semantic Coherence",
y = "Exclusivity")
Iterate through Expectation-Maximization to select the best stm output. Adjust K for the optimal number of topics for the corpus: K = 5. Plot semantic coherence and exclusivity of the sets of topics, to quantify the results- then use the optimal model balancing semantic coherence and exclusivity.
modelSelect <- selectModel(out$documents, out$vocab, K = 5,
prevalence =~ doc_id, max.em.its = 75,
data = out$meta, runs = 20, seed = 23456)
# pdf("stmVignette-009.pdf")
plotModels(modelSelect)
# dev.off()
selectedmodel <- modelSelect$runout[[2]]
Another possible model uses init.type = ‘Spectral’ to use the built-in algorithm selecting the optimal K. This is computationally intense, not as recommended
Use metrics including FREX to characterize the models performance, words content. Find representative documents for the topic and quotes, words summarizing the frequent themes of the topic.See how segments of the words are more distinct in some quotes, vs. others.
labelTopics(selectedmodel, c(1, 5))
## Topic 1 Top Words:
## Highest Prob: didn’t, like, just, say, felt, help, said
## FREX: see, said, come, whole, forget, appoint, weren’t
## Lift: ’ll, “’ll, bring, follow, forget, gentl, grew
## Score: see, forget, whole, come, just, version, fight
## Topic 5 Top Words:
## Highest Prob: like, just, didn’t, need, ask, help, kept
## FREX: still, scare, “’m, good, perfect, kept, ask
## Lift: minim, scare, worse”, couch, “just, imagin, backup
## Score: just, rememb, yeah, kept, scare, perfect, minut
thoughts1 <- findThoughts(selectedmodel, texts = shortdoc,
n = 2, topics = 1)$docs[[1]]
thoughts5 <- findThoughts(selectedmodel, texts = shortdoc,
n = 2, topics = 5)$docs[[1]]
# pdf("stmVignette-015.pdf")
par(mfrow = c(2, 1), mar = c(.5, .5, 1, .5))
plotQuote(thoughts1, width = 40, maxwidth = 120, main = "Topic 1")
plotQuote(thoughts5, width = 40, maxwidth = 120, main = "Topic 5")
# dev.off()
Adjust the stm by including other variables: covariates from metadata. Flexible, adjust for corpus.Look at the proportion of topics and expected doc composition (?). Note that time variables require different handling, for dates.
meta$excerpt <- as.factor(meta$excerpt)
prep <- estimateEffect(1:5 ~ excerpt, selectedmodel,
meta = out$meta, uncertainty = "Global")
summary(prep, topics = 1)
##
## Call:
## estimateEffect(formula = 1:5 ~ excerpt, stmobj = selectedmodel,
## metadata = out$meta, uncertainty = "Global")
##
##
## Topic 1:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.184166 0.121906 1.511 0.142
## excerpt -0.006592 0.012844 -0.513 0.612
# pdf("stmVignette-017.pdf")
plot(selectedmodel, type = "summary")
# dev.off()
# pdf("stmVignette-019.pdf")
# plot(prep, "day", method = "continuous", topics = 13,
# model = z, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)")
# monthseq <- seq(from = as.Date("2008-01-01"),
# to = as.Date("2008-12-01"), by = "month")
# monthnames <- months(monthseq)
# axis(1,
# at = as.numeric(monthseq) - min(as.numeric(monthseq)),
# labels = monthnames)
# dev.off()
Other plots that include stm word clouds, topic correlations, and convergence. Topics may go through other examinations and curation, especially human validation.
# pdf("stmVignette-025.pdf")
cloud(selectedmodel, topic = 5, scale = c(2, .25))
# dev.off()
mod.out.corr <- topicCorr(selectedmodel)
# pdf("stmVignette-027.pdf")
plot(mod.out.corr)
# dev.off()
# pdf("stmVignette-028.pdf")
plot(selectedmodel$convergence$bound, type = "l",
ylab = "Approximate Objective",
main = "Convergence")
# dev.off()
Structural topic models are built to incorporate covariate information; which would be the best for the corpus?
Again using tidytext, examine word pair co-occurrence in the documents. Consider what measures and network statistics are important to describe. Must have context from connected words, negations. High FREX words should be the focus of these graphs, instead of stop words.
tidy_interview_pairs <- tidy_interview_token %>%
pairwise_count(word, doc_id, sort = TRUE, upper = FALSE)
# network
set.seed(1234)
tidy_interview_pairs %>%
filter(n >= 10) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
Correlation between pairs of words, running the same steps. Adjust for corpus, statistical tests.
# word correlation
tidy_interview_cors <- tidy_interview_token %>%
group_by(word) %>%
filter(n() >= 10) %>%
pairwise_cor(word, doc_id, sort = TRUE, upper = FALSE)
set.seed(1234)
tidy_interview_cors %>%
filter(correlation > .6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "royalblue") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
Use the quanteda package to create a feature co-occurrence matrix, then run essentially the same procedure to measure word co-occurrence. Use the interview line as the unit of length of text. The same analysis can be done on a window of words, instead of the unit of words. Stop words must be removed for a more effective measurement.
word_tokens <- tokens(c(interview_text$text)) %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords("english"), padding = FALSE)
interview_fcm <- fcm(word_tokens, context = "document")
top_feats <- rowSums(interview_fcm) %>%
sort(decreasing = TRUE) %>%
head(25)
fcm_subset <- fcm_select(interview_fcm, pattern = names(top_feats))
set.seed(2017)
fcm_select(fcm_subset) %>%
textplot_network(min_freq = 0.5)
interview_win_fcm <- fcm(word_tokens, context = "window", window = 5)
top_feats <- rowSums(interview_win_fcm) %>%
sort(decreasing = TRUE) %>%
head(25)
fcm_win_subset <- fcm_select(interview_win_fcm, pattern = names(top_feats))
set.seed(2017)
fcm_select(fcm_win_subset) %>%
textplot_network(min_freq = 0.5)
Again, measure and track feature co-occurrence. This time, use each doc as the unit of text length. These networks can be improved in their quantitative output and informativeness for the corpus text. Visualization may also be improved for clarity. Again, stop words- remove.
word_tokens <- tokens(c(interview_whole_text$text)) %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords("english"), padding = FALSE)
interview_fcm <- fcm(word_tokens, context = "document")
top_feats <- rowSums(interview_fcm) %>%
sort(decreasing = TRUE) %>%
head(25)
fcm_subset <- fcm_select(interview_fcm, pattern = names(top_feats))
set.seed(2017)
fcm_select(fcm_subset) %>%
textplot_network(min_freq = 0.5)