# more options below, as needed
options(knitr.kable.max_rows = 10)
Summary of the methods and objectives. The main missing component is usage of LIWC and dictionary-based methods. The outline is as follows: handle the data and prepare, examine individual words, n-grams of words, and networks, incorporating natural discourse features. Then explore structural topic models to collect themes. Co-occurrence and correlation networks are the final component in this notebook.
Load text data. Then clean the data by removing blank rows, labeling by excerpt (document id), fixing name codes, and removing talk from the social worker. Output dfs have different levels of text parsing: by interview line, interview document itself, and interview sentence. Adjust for corpus.
source_text <- "C:/Users/newsomevw/OneDrive - National Institutes of Health/Desktop/TM scripts/interview_copilot_sample.txt"
original_interview <- read_lines(source_text, locale = locale(encoding = "UTF-8"))
original_interview <- original_interview[original_interview != ""]
interview_df <- tibble(text = original_interview) %>%
mutate(excerpt = str_extract(text, "(?<=^EXCERPT\\s)\\d+"), .before = 1) %>%
fill(excerpt, .direction = "down") %>%
filter(!str_detect(text, "^EXCERPT")) %>%
mutate(excerpt = as.integer(excerpt)) %>%
filter(!str_detect(text, "^SW:")) %>%
mutate(text = str_remove(text, "^CG:\\s")) %>%
mutate(text = gsub("([0-9]{2})-([0-9]{3})", "\\1\\2", text))
int_doc_df <- interview_df %>%
group_by(excerpt) %>%
summarise(doc_text = str_flatten(text, collapse = " "))
int_sentence_df <- interview_df %>%
unnest_tokens(sentence_text, text, token = "sentences", to_lower = FALSE)
Metadata can be loaded separately. Basic survey questions are included and may be explored. Adjust for corpus.
source_metadata <- "C:/Users/newsomevw/OneDrive - National Institutes of Health/Desktop/TM scripts/interview_copilot_csv.csv"
meta_df <- read_csv(source_metadata) %>%
mutate(excerpt = row_number(), .before = 1)
meta_df %>%
mutate(across(where(is.character), as.factor)) %>%
summary() %>%
kable(format = "simple")
| excerpt | Gender | Age | Race | MaritalStatus | PsychDistress | RelationalSatisfaction | |
|---|---|---|---|---|---|---|---|
| Min. : 1.0 | F:9 | Min. :33.0 | B:7 | M :6 | N:8 | N:8 | |
| 1st Qu.: 4.5 | M:6 | 1st Qu.:38.5 | W:8 | NM:9 | Y:7 | Y:7 | |
| Median : 8.0 | NA | Median :42.0 | NA | NA | NA | NA | |
| Mean : 8.0 | NA | Mean :42.8 | NA | NA | NA | NA | |
| 3rd Qu.:11.5 | NA | 3rd Qu.:46.5 | NA | NA | NA | NA | |
| Max. :15.0 | NA | Max. :53.0 | NA | NA | NA | NA |
If necessary, and the text needs to be converted to a specific, tabular format for LIWC or other dictionary-based methods, then convert at or before this line.
Use the unnest_tokens function from tidytext to create a token table. Load stop words, but do not remove them yet. Remove noisy words, if later discovered. Careful with the stop-words that mean not, or other negation.
data(stop_words)
tidy_int_doc <- int_doc_df %>%
unnest_tokens(word, doc_text)
tidy_interview <- interview_df %>%
mutate(line = row_number(), .after = 1) %>%
unnest_tokens(word, text)
tidy_int_sentence <- int_sentence_df %>%
mutate(sentence = row_number(), .after = 1) %>%
unnest_tokens(word, sentence_text)
Calculate term frequency-inverse document frequency for each word, relative to the document. Use the bind function from tidytext. Removed stop words vs. inflation, and removed names, which created the most distinction between docs without a linguistic difference.
int_doc_words <- tidy_int_doc %>%
anti_join(stop_words) %>%
filter(is.na(as.numeric(as.character(word)))) %>%
count(excerpt, word, sort = TRUE)
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `is.na(as.numeric(as.character(word)))`.
## Caused by warning:
## ! NAs introduced by coercion
int_doc_words %>%
group_by(excerpt) %>%
slice_max(n, n = 5)
## # A tibble: 122 × 3
## # Groups: excerpt [15]
## excerpt word n
## <int> <chr> <int>
## 1 1 feel 6
## 2 1 i’d 6
## 3 1 yeah 4
## 4 1 didn’t 3
## 5 1 honestly 3
## 6 1 hours 3
## 7 1 i’m 3
## 8 1 listen 3
## 9 1 they’d 3
## 10 1 thinking 3
## # ℹ 112 more rows
total_doc_words <- int_doc_words %>%
group_by(excerpt) %>%
summarize(total = sum(n))
int_doc_words <- left_join(int_doc_words, total_doc_words)
int_doc_tf_idf <- int_doc_words %>%
bind_tf_idf(word, excerpt, n)
int_doc_tf_idf %>%
arrange(desc(tf_idf)) %>%
kable(format = "simple")
| excerpt | word | n | total | tf | idf | tf_idf |
|---|---|---|---|---|---|---|
| 8 | boundaries | 3 | 136 | 0.0220588 | 2.708050 | 0.0597364 |
| 2 | intentions | 3 | 158 | 0.0189873 | 2.708050 | 0.0514187 |
| 9 | speak | 2 | 132 | 0.0151515 | 2.708050 | 0.0410311 |
| 3 | chaos | 2 | 136 | 0.0147059 | 2.708050 | 0.0398243 |
| 3 | dropping | 2 | 136 | 0.0147059 | 2.708050 | 0.0398243 |
| 3 | spent | 2 | 136 | 0.0147059 | 2.708050 | 0.0398243 |
| 6 | crisis | 3 | 123 | 0.0243902 | 1.609438 | 0.0392546 |
| 5 | job | 2 | 141 | 0.0141844 | 2.708050 | 0.0384121 |
| 8 | respected | 3 | 136 | 0.0220588 | 1.609438 | 0.0355023 |
| 2 | attention | 2 | 158 | 0.0126582 | 2.708050 | 0.0342791 |
int_doc_tf_idf %>%
group_by(excerpt) %>%
slice_max(tf_idf, n = 5)
## # A tibble: 291 × 7
## # Groups: excerpt [15]
## excerpt word n total tf idf tf_idf
## <int> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 1 hours 3 187 0.0160 2.01 0.0323
## 2 1 listen 3 187 0.0160 2.01 0.0323
## 3 1 team 2 187 0.0107 2.71 0.0290
## 4 1 handle 2 187 0.0107 1.61 0.0172
## 5 1 thinking 3 187 0.0160 0.916 0.0147
## 6 2 intentions 3 158 0.0190 2.71 0.0514
## 7 2 attention 2 158 0.0127 2.71 0.0343
## 8 2 fears 2 158 0.0127 2.71 0.0343
## 9 2 quotes 2 158 0.0127 2.71 0.0343
## 10 2 scream 2 158 0.0127 2.01 0.0255
## # ℹ 281 more rows
TF-IDF of all docs for each word can provide info.
# for word relationships, widen
# pivot wider names from excerpt, values from tfidf, values fill 0
int_wide_tf_idf <- int_doc_tf_idf %>%
select(c(excerpt, word, tf_idf)) %>%
mutate(excerpt = paste("Excerpt", excerpt)) %>%
pivot_wider(
names_from = excerpt,
values_from = tf_idf,
values_fill = 0
) %>%
filter(!if_all(-1, ~ .x == 0))
int_wide_tf_idf
## # A tibble: 941 × 16
## word `Excerpt 3` `Excerpt 10` `Excerpt 7` `Excerpt 8` `Excerpt 9`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 didn’t 0.00406 0.00319 0.000807 0.00203 0.00314
## 2 i’d 0.000507 0.000798 0.00202 0.000507 0.00157
## 3 i’m 0.00924 0 0 0.0139 0.00476
## 4 yeah 0.00101 0.000399 0.000807 0.00101 0.00105
## 5 day 0.00462 0.0109 0.00735 0.00462 0
## 6 time 0.00376 0 0.00299 0.00751 0
## 7 don’t 0 0.00727 0.00735 0 0
## 8 it’s 0.00228 0.00179 0.00726 0.00684 0.00705
## 9 person 0 0 0 0 0
## 10 appointments 0 0.00530 0.00536 0 0.0278
## # ℹ 931 more rows
## # ℹ 10 more variables: `Excerpt 14` <dbl>, `Excerpt 1` <dbl>,
## # `Excerpt 4` <dbl>, `Excerpt 13` <dbl>, `Excerpt 2` <dbl>,
## # `Excerpt 6` <dbl>, `Excerpt 11` <dbl>, `Excerpt 12` <dbl>,
## # `Excerpt 5` <dbl>, `Excerpt 15` <dbl>
Basic sentiments are described for the words in the corpus. Later handle negations, given some of these words are negated.
int_doc_words <- tidy_int_doc %>%
filter(is.na(as.numeric(as.character(word)))) %>%
count(excerpt, word, sort = TRUE)
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `is.na(as.numeric(as.character(word)))`.
## Caused by warning:
## ! NAs introduced by coercion
int_sentiments <- int_doc_words %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(word, sentiment, sort = TRUE)
int_sentiments %>%
group_by(sentiment) %>%
slice_max(n, n = 15) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Create a table of word bigrams, to investigate their TF-IDF. Remove stop words from bigrams prior to calculating TF-IDF; this requires separation, filtering, and uniting the bigrams. Adjust n to observe different N-grams, as shown in the examples for n = 2, n = 3.
tidy_int_bigrams <- int_doc_df %>%
unnest_tokens(bigram, doc_text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram))
tidy_int_bigrams %>%
count(bigram, sort = TRUE) %>%
kable(format = "simple")
| bigram | n |
|---|---|
| it was | 45 |
| i was | 43 |
| i didn’t | 34 |
| made me | 32 |
| it made | 30 |
| like i | 29 |
| me feel | 29 |
| i felt | 28 |
| and then | 26 |
| but it | 23 |
bigrams_separated <- tidy_int_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
int_bigram_tf_idf <- bigrams_united %>%
count(excerpt, bigram) %>%
bind_tf_idf(excerpt, bigram, n) %>%
arrange(desc(tf_idf))
int_bigram_tf_idf %>%
kable(format = "simple")
| excerpt | bigram | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| 11 | 56003 stepped | 1 | 1 | 3.270836 | 3.270836 |
| 11 | 56005 helped | 1 | 1 | 3.270836 | 3.270836 |
| 11 | bad episode | 1 | 1 | 3.270836 | 3.270836 |
| 11 | body feeling | 1 | 1 | 3.270836 | 3.270836 |
| 11 | caregiver machine | 1 | 1 | 3.270836 | 3.270836 |
| 11 | constant questions | 1 | 1 | 3.270836 | 3.270836 |
| 11 | didn’t interfere | 1 | 1 | 3.270836 | 3.270836 |
| 11 | documenting symptoms | 1 | 1 | 3.270836 | 3.270836 |
| 11 | drive safely | 1 | 1 | 3.270836 | 3.270836 |
| 11 | feel worse | 1 | 1 | 3.270836 | 3.270836 |
tidy_interview_trigrams <- int_doc_df %>%
unnest_tokens(trigram, doc_text, token = "ngrams", n = 3) %>%
filter(!is.na(trigram))
tidy_interview_trigrams %>%
count(trigram, sort = TRUE) %>%
kable(format = "simple")
| trigram | n |
|---|---|
| it made me | 25 |
| made me feel | 23 |
| me feel like | 16 |
| like i was | 15 |
| feel like i | 12 |
| i didn’t want | 11 |
| just that i | 11 |
| say things like | 11 |
| that i learned | 11 |
| it felt like | 8 |
trigrams_separated <- tidy_interview_trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
trigrams_filtered <- trigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word)
trigrams_counts <- trigrams_filtered %>%
count(word1, word2, word3, sort = TRUE)
trigrams_united <- trigrams_filtered %>%
unite(trigram, word1, word2, word3, sep = " ")
interview_trigram_tf_idf <- trigrams_united %>%
count(excerpt, trigram) %>%
bind_tf_idf(excerpt, trigram, n) %>%
arrange(desc(tf_idf))
interview_trigram_tf_idf %>%
kable(format = "simple")
| excerpt | trigram | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| 8 | 94002 whoa surprisingly | 1 | 1 | 4.436751 | 4.436751 |
| 8 | feel respected completely | 1 | 1 | 4.436751 | 4.436751 |
| 11 | documenting symptoms rest | 1 | 1 | 4.031286 | 4.031286 |
| 11 | i’d start worrying | 1 | 1 | 4.031286 | 4.031286 |
| 11 | symptoms rest wasn’t | 1 | 1 | 4.031286 | 4.031286 |
| 6 | 78003 i’m grateful | 1 | 1 | 3.338139 | 3.338139 |
| 6 | consistent yeah 78005 | 1 | 1 | 3.338139 | 3.338139 |
| 6 | there’s 78003 who’s | 1 | 1 | 3.338139 | 3.338139 |
| 6 | they’d start spiraling | 1 | 1 | 3.338139 | 3.338139 |
| 6 | wasn’t consistent yeah | 1 | 1 | 3.338139 | 3.338139 |
Additional analysis on the impact of not and broader negation words (not, no, never, without). A different sentiment lexicon is used, for a numerical sentiment score. Also note that the colors plotted correspond to the original sentiment- the sentiment is opposite, with negation.
bigrams_separated %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE) %>%
kable(format = "simple")
| word1 | word2 | n |
|---|---|---|
| not | just | 4 |
| not | a | 3 |
| not | to | 3 |
| not | because | 2 |
| not | really | 2 |
| not | actually | 1 |
| not | around | 1 |
| not | at | 1 |
| not | before | 1 |
| not | consistently | 1 |
tidy_int_not_words <- bigrams_separated %>%
filter(word1 == "not") %>%
inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
tidy_int_not_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"not\"")
negation_words <- c("not", "no", "never", "without")
bigrams_separated %>%
filter(word1 %in% negation_words) %>%
count(word1, word2, sort = TRUE) %>%
kable(format = "simple")
| word1 | word2 | n |
|---|---|---|
| no | one | 5 |
| without | me | 5 |
| not | just | 4 |
| not | a | 3 |
| not | to | 3 |
| no | fuss | 2 |
| no | guilt | 2 |
| not | because | 2 |
| not | really | 2 |
| never | asked | 1 |
tidy_int_negation_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE)
tidy_int_negation_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by a negation word")
Plot a network of bigrams, weighted by occurrence frequency. Then show the difference between a directed and undirected network, for the corpus. Adjust for corpus. This graph would be improved with pre-processing to highlight bigrams with dictionary-specific words, relational words, or highest-frequency words.
bigram_graph <- bigrams_counts %>%
filter(n > 1) %>%
graph_from_data_frame()
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
set.seed(2020)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
Add network data in meta format, if it exists. If not, then generate from token occurrence in sentences. Basic definition of interaction, then refine. Start with annotation to capture relations.
cnlp_init_udpipe() # run once
interview_anno <- cnlp_annotate(
input = int_doc_df,
text_name = "doc_text",
doc_name = "excerpt"
)
## Processed document 10 of 15
interview_anno
## $token
## # A tibble: 9,028 × 11
## doc_id sid tid token token_with_ws lemma upos xpos feats tid_source
## * <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 1 Yeah "Yeah" yeah INTJ UH <NA> 7
## 2 1 1 2 … "… " … PUNCT . <NA> 1
## 3 1 1 3 um "um" um INTJ UH <NA> 1
## 4 1 1 4 , ", " , PUNCT , <NA> 7
## 5 1 1 5 it "it " it PRON PRP Case=Nom… 7
## 6 1 1 6 was "was " be AUX VBD Mood=Ind… 7
## 7 1 1 7 rough "rough" rough ADJ JJ Degree=P… 0
## 8 1 1 8 . ". " . PUNCT . <NA> 7
## 9 1 2 1 I "I " I PRON PRP Case=Nom… 2
## 10 1 2 2 mean "mean" mean VERB VBP Mood=Ind… 8
## # ℹ 9,018 more rows
## # ℹ 1 more variable: relation <chr>
##
## $document
## # A tibble: 15 × 1
## doc_id
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## 11 11
## 12 12
## 13 13
## 14 14
## 15 15
##
## attr(,"class")
## [1] "cnlp_annotation" "list"
Nodes filter: name code OR Singular non-neuter non-possessive pronoun (removed OR Plural non-possessive pronoun). This should accept masculine or feminine Singular pronouns, but none are present in the corpus. Adjust for corpus. Adjust filter on node_sentences to identify that I, me, and myself are the same person.
# tokens that represent nodes
interview_anno_nodes <- interview_anno$token %>%
filter(
str_detect(token, "^\\d{5}$") |
(str_detect(feats, "Number=Sing") & str_detect(feats, "Person") &
!str_detect(feats, "Gender=Neut") & !str_detect(feats, "Poss=Yes")) |
(str_detect(feats, "Number=Plur") & str_detect(feats, "Person")
& !str_detect(feats, "Poss=Yes"))
) %>%
filter(upos != "AUX" & upos != "VERB")
# sentences id containing these tokens; 2 or more entities
node_sentences <- interview_anno_nodes %>%
group_by(doc_id, sid) %>%
filter(sum(str_detect(feats, "Number=Sing") & str_detect(feats, "Person") &
!str_detect(feats, "Gender=Neut") & !str_detect(feats, "Poss=Yes")) <= 1
) %>%
ungroup() %>%
count(doc_id, sid) %>%
filter(n > 1)
# get sentences text (tokens)
interview_anno_nodes <- interview_anno_nodes %>%
inner_join(node_sentences, by = c("doc_id", "sid"))
interview_anno_nodes_sentences <- interview_anno$token %>%
inner_join(node_sentences, by = c("doc_id", "sid"))
interview_anno_nodes %>%
distinct(upos)
## # A tibble: 2 × 1
## upos
## <chr>
## 1 PRON
## 2 NUM
interview_anno_nodes %>%
distinct(token)
## # A tibble: 53 × 1
## token
## <chr>
## 1 they
## 2 I
## 3 21001
## 4 me
## 5 we
## 6 21002
## 7 them
## 8 21004
## 9 They
## 10 34001
## # ℹ 43 more rows
interview_anno$token
## # A tibble: 9,028 × 11
## doc_id sid tid token token_with_ws lemma upos xpos feats tid_source
## * <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 1 Yeah "Yeah" yeah INTJ UH <NA> 7
## 2 1 1 2 … "… " … PUNCT . <NA> 1
## 3 1 1 3 um "um" um INTJ UH <NA> 1
## 4 1 1 4 , ", " , PUNCT , <NA> 7
## 5 1 1 5 it "it " it PRON PRP Case=Nom… 7
## 6 1 1 6 was "was " be AUX VBD Mood=Ind… 7
## 7 1 1 7 rough "rough" rough ADJ JJ Degree=P… 0
## 8 1 1 8 . ". " . PUNCT . <NA> 7
## 9 1 2 1 I "I " I PRON PRP Case=Nom… 2
## 10 1 2 2 mean "mean" mean VERB VBP Mood=Ind… 8
## # ℹ 9,018 more rows
## # ℹ 1 more variable: relation <chr>
interview_anno_nodes
## # A tibble: 340 × 12
## doc_id sid tid token token_with_ws lemma upos xpos feats tid_source
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 3 4 they "they " they PRON PRP Case=Nom… 6
## 2 1 3 12 I "I" I PRON PRP Case=Nom… 17
## 3 1 4 6 21001 "21001 " 21001 NUM CD NumType=… 7
## 4 1 4 9 me "me " I PRON PRP Case=Acc… 8
## 5 1 14 1 I "I " I PRON PRP Case=Nom… 2
## 6 1 14 7 we "we" we PRON PRP Case=Nom… 8
## 7 1 14 17 21002 "21002 " 21002 NUM CD NumType=… 19
## 8 1 16 2 21001 "21001 " 21001 NUM CD NumType=… 3
## 9 1 16 6 I "I" I PRON PRP Case=Nom… 7
## 10 1 23 1 I "I " I PRON PRP Case=Nom… 2
## # ℹ 330 more rows
## # ℹ 2 more variables: relation <chr>, n <int>
node_sentences
## # A tibble: 151 × 3
## doc_id sid n
## <int> <int> <int>
## 1 1 3 2
## 2 1 4 2
## 3 1 14 3
## 4 1 16 2
## 5 1 23 2
## 6 1 25 2
## 7 1 26 2
## 8 1 33 2
## 9 1 35 3
## 10 1 57 2
## # ℹ 141 more rows
interview_anno_nodes_sentences
## # A tibble: 2,584 × 12
## doc_id sid tid token token_with_ws lemma upos xpos feats tid_source
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 3 1 One "One " one NUM CD NumTyp… 2
## 2 1 3 2 day "day" day NOUN NN Number… 6
## 3 1 3 3 , ", " , PUNCT , <NA> 2
## 4 1 3 4 they "they " they PRON PRP Case=N… 6
## 5 1 3 5 were "were " be AUX VBD Mood=I… 6
## 6 1 3 6 stable "stable" stable ADJ JJ Degree… 0
## 7 1 3 7 , ", " , PUNCT , <NA> 17
## 8 1 3 8 and "and " and CCONJ CC <NA> 17
## 9 1 3 9 the "the " the DET DT Defini… 11
## 10 1 3 10 next "next " next ADJ JJ Degree… 11
## # ℹ 2,574 more rows
## # ℹ 2 more variables: relation <chr>, n <int>
Now to create the network, remove make a new table containing the interaction between tokens (individuals).
# handling to match I, me, my all to self; may be re-written for clarity
interview_network <- interview_anno_nodes %>%
select(doc_id, sid, token) %>%
mutate(
token = case_match(
token,
"me" ~ "self",
"I" ~ "self",
.default = token
)
) %>%
group_by(doc_id, sid) %>%
mutate(
ref_id = first(na.omit(token[str_detect(token, "^\\d{5}$")])),
token = if_else(token %in% c("they", "them", "we"), ref_id, token)
) %>%
ungroup() %>%
filter(!is.na(token)) %>%
distinct(doc_id, sid, token) %>% # optional: one token per sentence
group_by(doc_id, sid) %>%
mutate(context = 1L) %>% # one co-occurrence context per sentence
pairwise_count(token, context, upper = FALSE, sort = TRUE) %>%
ungroup() %>%
transmute(doc_id, sid, name1 = item1, name2 = item2)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `token = case_match(token, "me" ~ "self", "I" ~ "self", .default
## = token)`.
## Caused by warning:
## ! `case_match()` was deprecated in dplyr 1.2.0.
## ℹ Please use `recode_values()` instead.
# interview_network <- as_tbl_graph(edges, directed = FALSE)
# sentiment- assign values to edges
edge_scores <- interview_anno_nodes_sentences %>%
filter(upos %in% c("VERB", "ADV", "ADJ")) %>%
inner_join(get_sentiments("afinn"), by = c("token" = "word")) %>%
group_by(doc_id, sid) %>%
summarise(edge_score = mean(value, na.rm = TRUE), .groups = "drop")
interview_network <- interview_network %>%
left_join(edge_scores, by = c("doc_id", "sid"))
Single version of below function.
n = 3
int_net_graph <- interview_network %>%
filter(doc_id == n) %>%
group_by(name1, name2) %>%
summarise(
n = n(),
edge_score = first(edge_score),
.groups = "drop"
) %>%
graph_from_data_frame()
summary(E(int_net_graph)$edge_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -3.00 -2.75 -2.50 -2.50 -2.25 -2.00 2
for (n in 1:15) {
int_net_graph <- interview_network %>%
filter(doc_id == n) %>%
group_by(name1, name2) %>%
summarise(
n = n(),
edge_score = first(edge_score),
.groups = "drop"
) %>%
graph_from_data_frame()
set.seed(2026)
p <- ggraph(int_net_graph, layout = "fr") +
geom_edge_link(aes(edge_colour = edge_score)) +
scale_edge_color_gradient2(
low = "red", mid = "turquoise", high = "blue",
midpoint = 0
) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
print(p)
}
# how do words associate with actors?
interview_anno_nodes_sentences |>
filter(upos == "VERB") |>
group_by(doc_id, lemma) |>
summarize(count = n()) |>
arrange(desc(count)) |>
slice(1:8) |>
summarize(lemma_paste = paste(lemma, collapse = "; ")) %>%
kable(format = "simple")
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by doc_id and lemma.
## ℹ Output is grouped by doc_id.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(doc_id, lemma))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
| doc_id | lemma_paste |
|---|---|
| 1 | ’d; keep; listen; try; ask; breathe; call; calme |
| 2 | feel; ask; follow; give; insist; look; mean; mess |
| 3 | know; ask; feel; say; spend; come; find; get |
| 4 | do; say; ask; send; ’d; call; feel; hold |
| 5 | feel; make; ’d; help; say; try; deal; deserve |
| 6 | ’d; ask; feel; be; call; deal; do; drive |
| 7 | feel; ’d; get; offer; book; bring; car; care |
| 8 | ’d; say; get; leave; need; push; reorganize; try |
| 9 | ’d; ask; come; help; need; say; take; be |
| 10 | ’d; say; get; help; mean; adjust; argue; be |
# how are the actors associated?
interview_anno_nodes_sentences_sentiment <- interview_anno_nodes_sentences %>%
filter(relation == "root") %>%
inner_join(get_sentiments("bing"), by = c("lemma" = "word"))
interview_anno_nodes_sentences_sentiment
## # A tibble: 7 × 13
## doc_id sid tid token token_with_ws lemma upos xpos feats tid_source
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 3 6 stable "stable" stable ADJ JJ Degr… 0
## 2 6 2 14 calm "calm " calm ADJ JJ Degr… 0
## 3 9 20 5 defensive "defensive" defen… ADJ JJ Degr… 0
## 4 11 18 4 interfere "interfere" inter… VERB VB Verb… 0
## 5 11 51 2 trusted "trusted " trust VERB VBD Mood… 0
## 6 14 26 4 joke "joke " joke VERB VB Verb… 0
## 7 15 37 5 support "support" suppo… NOUN NN Numb… 0
## # ℹ 3 more variables: relation <chr>, n <int>, sentiment <chr>
# not enough sample size, here
# token and token source
# interview_anno_nodes_sentences
# join doc_id = doc_id, sid = sid, tid_source = tid
#
# doc_distance(angular)
Use the clean NLP package word features, lemmatization, and related word dictionaries. For NLP annotation, first load the right dict, then automatically compute the word characters. Highlight the most used nouns across docs and most used adjectives, also grouped by doc. Check how much we-language is used compared to total language.
interview_anno$token <- interview_anno$token %>%
filter(upos != "PUNCT")
# select(interview_anno, token, xpos, feats, tid_source, relation)
interview_anno %>%
kable(format = "simple")
| doc_id | sid | tid | token | token_with_ws | lemma | upos | xpos | feats | tid_source | relation |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Yeah | Yeah | yeah | INTJ | UH | NA | 7 | discourse |
| 1 | 1 | 3 | um | um | um | INTJ | UH | NA | 1 | discourse |
| 1 | 1 | 5 | it | it | it | PRON | PRP | Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs | 7 | nsubj |
| 1 | 1 | 6 | was | was | be | AUX | VBD | Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin | 7 | cop |
| 1 | 1 | 7 | rough | rough | rough | ADJ | JJ | Degree=Pos | 0 | root |
| 1 | 2 | 1 | I | I | I | PRON | PRP | Case=Nom|Number=Sing|Person=1|PronType=Prs | 2 | nsubj |
| 1 | 2 | 2 | mean | mean | mean | VERB | VBP | Mood=Ind|Tense=Pres|VerbForm=Fin | 8 | discourse |
| 1 | 2 | 4 | unpredictable | unpredictable | unpredictable | ADJ | JJ | Degree=Pos | 8 | nsubj |
| 1 | 2 | 5 | is | is | be | AUX | VBZ | Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin | 8 | cop |
| 1 | 2 | 6 | the | the | the | DET | DT | Definite=Def|PronType=Art | 8 | det |
| doc_id |
|---|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
interview_anno$token |>
filter(upos == "NOUN") |>
group_by(lemma) |>
summarize(count = n()) |>
arrange(desc(count)) %>%
kable(format = "simple")
| lemma | count |
|---|---|
| thing | 38 |
| day | 26 |
| time | 20 |
| call | 16 |
| people | 16 |
| support | 16 |
| way | 12 |
| one | 11 |
| person | 11 |
| symptom | 11 |
interview_anno$token |>
filter(upos == "ADJ") |>
group_by(doc_id, lemma) |>
summarize(count = n()) |>
arrange(desc(count)) |>
slice(1:8) |>
summarize(lemma_paste = paste(lemma, collapse = "; ")) %>%
kable(format = "simple")
| doc_id | lemma_paste |
|---|---|
| 1 | more; different; much; small; acknowledgment; alone; best; big |
| 2 | big; more; right; available; bright; emotional; generous; good |
| 3 | scared; worse; better; big; fancy; first; full; gourmet |
| 4 | much; perfect; better; calm; consistent; emotional; enough; fake |
| 5 | full; whole; alone; bright; close; closer; denial; enough |
| 6 | calm; actual; bad; close; consistent; else; grateful; great |
| 7 | much; real; alone; bad; closest; complicated; confused; connect |
| 8 | more; closer; defensive; dietary; different; difficult; distant; do |
| 9 | much; bad; beautiful; best; better; big; constant; defensive |
| 10 | much; whole; actual; afraid; blue; close; closer; confus |
interview_anno$token |>
filter(doc_id %% 2 != 0) |>
group_by(doc_id) |>
summarize(
n_plural = sum(grepl("Number=Plur", feats)),
n_singular = sum(grepl("Number=Sing", feats)),
ratio_plural_total = n_plural / (n_plural + n_singular)
) |>
arrange(desc(ratio_plural_total)) %>%
kable(format = "simple")
| doc_id | n_plural | n_singular | ratio_plural_total |
|---|---|---|---|
| 11 | 45 | 121 | 0.2710843 |
| 13 | 44 | 137 | 0.2430939 |
| 9 | 34 | 119 | 0.2222222 |
| 15 | 28 | 105 | 0.2105263 |
| 1 | 42 | 164 | 0.2038835 |
| 3 | 29 | 121 | 0.1933333 |
| 5 | 27 | 122 | 0.1812081 |
| 7 | 33 | 165 | 0.1666667 |
Modify the functions here for proper context with surrounding words, significant meaning to the corpus, and correct use of linguistic markers. Example words that may encompass important dimensions of the text, then highlight the intersection with language data.
dim_words = c('only', 'owner', 'affected', 'others', 'else', 'problem', 'responsible', 'deal', 'plan', 'manage', 'cope', 'rely', 'support', 'open', 'depend', 'self', 'care')
# placeholder for relation words
interview_anno$token%>%
filter(lemma %in% dim_words) %>%
kable(format = "simple")
| doc_id | sid | tid | token | token_with_ws | lemma | upos | xpos | feats | tid_source | relation |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 29 | 11 | support | support | support | NOUN | NN | Number=Sing | 13 | nsubj |
| 2 | 69 | 8 | rely | rely | rely | VERB | VB | VerbForm=Inf | 4 | ccomp |
| 3 | 4 | 5 | only | only | only | ADJ | JJ | Degree=Pos | 6 | amod |
| 3 | 37 | 10 | support | support | support | NOUN | NN | Number=Sing | 11 | nsubj |
| 3 | 43 | 12 | deal | deal | deal | NOUN | NN | Number=Sing | 8 | obj |
| 4 | 8 | 12 | manage | manage | manage | VERB | VB | VerbForm=Inf | 8 | advcl |
| 4 | 30 | 1 | supported | supported | support | VERB | VBD | Mood=Ind|Tense=Past|VerbForm=Fin | 0 | root |
| 4 | 36 | 4 | dealing | dealing | deal | VERB | VBG | VerbForm=Ger | 7 | advcl |
| 4 | 37 | 8 | rely | rely | rely | VERB | VB | VerbForm=Inf | 6 | advcl |
| 4 | 39 | 6 | depend | depend | depend | VERB | VB | VerbForm=Inf | 4 | xcomp |
First, pre-process the data and organize proper variable names for the stm package. The tokenization performed here has a different working format compared to tidytext. Decide lower threshold, with the default being occurrence in > 1 document. “Just” should also be removed, as its high frequency detracts from topical domain words. May un-comment pdf lines to save outputs to pdf format.
set.seed(23456)
processed <- textProcessor(documents = interview_df$text, metadata = interview_df)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
out <- prepDocuments(documents = processed$documents,
vocab = processed$vocab,
meta = processed$meta,
lower.thresh = 1)
## Removing 578 of 1040 terms (578 of 3341 tokens) due to frequency
## Your corpus now has 194 documents, 462 terms and 2763 tokens.
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
# plotRemoved(processed$documents, lower.thresh = seq(1, 200, by = 100))
The plotRemoved function is helpful for exmaining different thresholds, but minimally useful with a small number of documents and less threshold decision-making.
Fit the starting model, adjusting thresholds as needed. Then try running stm with a range of K values to see what number of topics works best. Shortdoc run, uses first 200 chars and confirm success. Search a range of K topic numbers, adjust for corpus. Examine plots to try and find the highest exclusivity and highest (least negative) semantic coherence, and other optimal values.
# create the out object in the prior chunk
shortdoc <- substr(interview_df$text, 1, 200)
invisible(modelPrevFit <- stm(documents = out$documents, vocab = out$vocab,
K = 10, prevalence =~ excerpt,
max.em.its = 75,
data = out$meta, init.type = "Spectral", verbose = FALSE))
invisible(capture.output(storage <- searchK(out$documents, out$vocab, K = 3:15,
prevalence =~ excerpt, data = meta)))
t <- storage$results[[1]]
t <- storage$results[[2]]
plot(storage)
plot_data <- storage$results %>%
mutate(
semcoh = as.numeric(unlist(semcoh)),
exclus = as.numeric(unlist(exclus)),
K = as.factor(unlist(K))
)
ggplot(plot_data, aes(x = semcoh, y = exclus, group = 1)) +
geom_line() +
geom_point() +
geom_text(aes(label = K), vjust = -1) +
labs(title = "Topic Model Diagnostics",
x = "Semantic Coherence",
y = "Exclusivity")
Iterate through Expectation-Maximization to select the best stm output. Adjust K for the optimal number of topics for the corpus: K = 5. Plot semantic coherence and exclusivity of the sets of topics, to quantify the results- then use the optimal model balancing semantic coherence and exclusivity.
modelSelect <- selectModel(out$documents, out$vocab, K = 5,
prevalence =~ excerpt, max.em.its = 75,
data = out$meta, runs = 20, seed = 23456)
# pdf("stmVignette-009.pdf")
plotModels(modelSelect)
# dev.off()
selectedmodel <- modelSelect$runout[[2]]
Another possible model uses init.type = ‘Spectral’ to use the built-in algorithm selecting the optimal K. This is computationally intense, not as recommended
Use metrics including FREX to characterize the models performance, words content. Find representative documents for the topic and quotes, words summarizing the frequent themes of the topic.See how segments of the words are more distinct in some quotes, vs. others.
labelTopics(selectedmodel, c(1, 5))
## Topic 1 Top Words:
## Highest Prob: like, say, thing, help, day, kept, wasn’t
## FREX: “’s, today”, question, updat, appoint, back, shout
## Lift: bodi, didn’t—, dose”, shout, stretch, phase, today”
## Score: like, day, “’s, check, thing, today”, updat
## Topic 5 Top Words:
## Highest Prob: exact, said, everyth, need, call, someon, much
## FREX: provid, said, offer, almost, juggl, mention, schedul
## Lift: “’ll, almost, confus, guilti, juggl, mention, offer
## Score: provid, said, offer, schedul, almost, new, need
thoughts1 <- findThoughts(selectedmodel, texts = shortdoc,
n = 2, topics = 1)$docs[[1]]
thoughts5 <- findThoughts(selectedmodel, texts = shortdoc,
n = 2, topics = 5)$docs[[1]]
# pdf("stmVignette-015.pdf")
par(mfrow = c(2, 1), mar = c(.5, .5, 1, .5))
plotQuote(thoughts1, width = 40, maxwidth = 120, main = "Topic 1")
plotQuote(thoughts5, width = 40, maxwidth = 120, main = "Topic 5")
# dev.off()
Adjust the stm by including other variables: covariates from metadata. Flexible, adjust for corpus.Look at the proportion of topics and expected doc composition (?). Note that time variables require different handling, for dates.
meta$excerpt <- as.factor(meta$excerpt)
prep <- estimateEffect(1:5 ~ excerpt, selectedmodel,
meta = out$meta, uncertainty = "Global")
summary(prep, topics = 1)
##
## Call:
## estimateEffect(formula = 1:5 ~ excerpt, stmobj = selectedmodel,
## metadata = out$meta, uncertainty = "Global")
##
##
## Topic 1:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.134192 0.023678 5.667 5.25e-08 ***
## excerpt 0.006684 0.002650 2.522 0.0125 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# pdf("stmVignette-017.pdf")
plot(selectedmodel, type = "summary")
# dev.off()
# pdf("stmVignette-019.pdf")
# plot(prep, "day", method = "continuous", topics = 13,
# model = z, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)")
# monthseq <- seq(from = as.Date("2008-01-01"),
# to = as.Date("2008-12-01"), by = "month")
# monthnames <- months(monthseq)
# axis(1,
# at = as.numeric(monthseq) - min(as.numeric(monthseq)),
# labels = monthnames)
# dev.off()
Other plots that include stm word clouds, topic correlations, and convergence. Topics may go through other examinations and curation, especially human validation.
# pdf("stmVignette-025.pdf")
cloud(selectedmodel, topic = 5, scale = c(2, .25))
# dev.off()
mod.out.corr <- topicCorr(selectedmodel)
# pdf("stmVignette-027.pdf")
plot(mod.out.corr)
# dev.off()
# pdf("stmVignette-028.pdf")
plot(selectedmodel$convergence$bound, type = "l",
ylab = "Approximate Objective",
main = "Convergence")
# dev.off()
Structural topic models are built to incorporate covariate information and may be used to investigate model validity.
Again using tidytext, examine word pair co-occurrence in the documents. Consider what measures and network statistics are important to describe. Must have context from connected words, negations. High FREX words should be the focus of these graphs, instead of stop words.
tidy_interview_pairs <- tidy_interview %>%
anti_join(stop_words) %>%
pairwise_count(word, line, sort = TRUE, upper = FALSE)
# network
set.seed(1234)
tidy_interview_pairs %>%
filter(n >= 5) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
Correlation between pairs of words, running the same steps. Adjust for corpus, statistical tests.
# word correlation
tidy_interview_cors <- tidy_interview %>%
group_by(word) %>%
filter(n() >= 3) %>%
pairwise_cor(word, line, sort = TRUE, upper = FALSE)
set.seed(1234)
tidy_interview_cors %>%
filter(correlation > .6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "royalblue") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
Use the quanteda package to create a feature co-occurrence matrix, then run essentially the same procedure to measure word co-occurrence. Use the interview line as the unit of length of text. The same analysis can be done on a window of words, instead of the unit of words. Stop words must be removed for a more effective measurement.
word_tokens <- tokens(c(interview_df$text)) %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords("english"), padding = FALSE)
interview_fcm <- fcm(word_tokens, context = "document")
top_feats <- rowSums(interview_fcm) %>%
sort(decreasing = TRUE) %>%
head(25)
fcm_subset <- fcm_select(interview_fcm, pattern = names(top_feats))
set.seed(2017)
fcm_select(fcm_subset) %>%
textplot_network(min_freq = 0.6)
interview_win_fcm <- fcm(word_tokens, context = "window", window = 5)
top_feats <- rowSums(interview_win_fcm) %>%
sort(decreasing = TRUE) %>%
head(25)
fcm_win_subset <- fcm_select(interview_win_fcm, pattern = names(top_feats))
set.seed(2017)
fcm_select(fcm_win_subset) %>%
textplot_network(min_freq = 0.6)
Again, measure and track feature co-occurrence. This time, use each doc as the unit of text length. These networks can be improved in their quantitative output and informativeness for the corpus text. Visualization may also be improved for clarity. Again, stop words- remove.
word_tokens <- tokens(c(int_doc_df$doc_text)) %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords("english"), padding = FALSE)
interview_fcm <- fcm(word_tokens, context = "document")
top_feats <- rowSums(interview_fcm) %>%
sort(decreasing = TRUE) %>%
head(25)
fcm_subset <- fcm_select(interview_fcm, pattern = names(top_feats))
set.seed(2017)
fcm_select(fcm_subset) %>%
textplot_network(min_freq = 0.6)
# quanteda dfm
# quanteda textstat_simil() textstat_dist()
# cosine similarity, other methods distance
This code is adapted from: tidytext website ch 4, ch 8, stm package paper, humanities data in R book ch 6, ch 7, and quanteda package documentation. APA 7 Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media. https://www.tidytextmining.com/ Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R Package for Structural Topic Models. Journal of Statistical Software, 91(2), 1–40. https://doi.org/10.18637/jss.v091.i02 Arnold, T., & Tilton, L. (2015). Humanities data in R: Exploring networks, geospatial data, images, and text. Springer. https://humanitiesdata.org/ Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774