class: center, middle, title-slide .title[ # Text Mining (Natural Language Processing) ] .subtitle[ ## JSC 370: Data Science II ] .date[ ### February 24, 2025 ] --- <style type="text/css"> .orange {color: #EF8633} </style> <!--Yeah... I have really long code chunks, so I just changed the default size :)--> <style type="text/css"> code.*, .remark-code, pre { font-size:15px; } body{ font-family: Helvetica; font-size: 12pt; } p,h1,h2,h3,h4 { font-family: system-ui } .html-widget { margin: auto; } code.r{ /* Code block */ font-size: 20px; } pre { /* Code block - determines code spacing between lines */ font-size: 20px; } </style> # What is NLP? Natural Language Processing (NLP) is used for <u> qualitative data </u> that is collected using open ended or free form text from a survey, medical provider notes in an electronic medical record (EMR), or a transcript of research participant interviews (Koleck et al., 2019). It is also called 'text mining'. --- # What is NLP used for? - Looking at frequencies of words and phrases in text. - Labeling relationships between words such as subject, object, modification. - Identify entities in free text, labeling them with types such as person, location, organization. - Coupled with AI it can predict words (autocomplete). --- # How can we do NLP? - We turn text into numbers. - Then use R and the tidyverse to explore those numbers. <img src="data:image/png;base64,#images/tidytext.png" width="80%" style="display: block; margin: auto;" /> --- # Why tidytext? Works seemlessly with ggplot2, dplyr and tidyr. **Alternatives:** **R**: quanteda, tm, koRpus **Python**: nltk, Spacy, gensim --- # Alice's Adventures in Wonderland Download the alice dataset from [here]("https://github.com/JSC370/jsc370-2022/blob/main/data/text/alice.rds"). There are 12 chapters. For `tidytext` to work properly, the text should be in a `tibble` (alice is a `tiblble`). ``` r alice <- readRDS("alice.rds") alice ``` ``` ## # A tibble: 3,351 × 3 ## text chapter chapter_name ## <chr> <int> <chr> ## 1 "CHAPTER I." 1 CHAPTER I. ## 2 "Down the Rabbit-Hole" 1 CHAPTER I. ## 3 "" 1 CHAPTER I. ## 4 "" 1 CHAPTER I. ## 5 "Alice was beginning to get very tired of sitting by he… 1 CHAPTER I. ## 6 "bank, and of having nothing to do: once or twice she h… 1 CHAPTER I. ## 7 "the book her sister was reading, but it had no picture… 1 CHAPTER I. ## 8 "conversations in it, “and what is the use of a book,” … 1 CHAPTER I. ## 9 "“without pictures or conversations?”" 1 CHAPTER I. ## 10 "" 1 CHAPTER I. ## # ℹ 3,341 more rows ``` --- # Tokenizing Turning text into smaller units, essentially splitting a sentence, phrase, paragraph or entire document into smaller units called tokens (i.e. individual words, numbers, or punctuation marks). Tokenization is needed for natural language processing. -- In English: - split by spaces - more advanced algorithms --- # Spacy tokenizer  --- ## Tokenizing with unnest_tokens ``` r alice |> unnest_tokens(token, text) ``` ``` ## # A tibble: 26,687 × 3 ## chapter chapter_name token ## <int> <chr> <chr> ## 1 1 CHAPTER I. chapter ## 2 1 CHAPTER I. i ## 3 1 CHAPTER I. down ## 4 1 CHAPTER I. the ## 5 1 CHAPTER I. rabbit ## 6 1 CHAPTER I. hole ## 7 1 CHAPTER I. alice ## 8 1 CHAPTER I. was ## 9 1 CHAPTER I. beginning ## 10 1 CHAPTER I. to ## # ℹ 26,677 more rows ``` --- ## Tokenizing with spaCy ``` python import pandas as pd import spacy import pandas as pd import spacy r_alice = r.alice alice_py= r_alice.to_pandas() nlp = spacy.load("en_core_web_sm") # Tokenization using spaCy alice_py["tokens"] = alice_py["text"].apply(lambda x: [token.text for token in nlp(x)]) alice_py ``` --- # Words as a unit Now that we have words as the observation unit we can use the **dplyr** toolbox. --- # Using dplyr verbs ``` r alice |> unnest_tokens(token, text) ``` ``` ## # A tibble: 26,687 × 3 ## chapter chapter_name token ## <int> <chr> <chr> ## 1 1 CHAPTER I. chapter ## 2 1 CHAPTER I. i ## 3 1 CHAPTER I. down ## 4 1 CHAPTER I. the ## 5 1 CHAPTER I. rabbit ## 6 1 CHAPTER I. hole ## 7 1 CHAPTER I. alice ## 8 1 CHAPTER I. was ## 9 1 CHAPTER I. beginning ## 10 1 CHAPTER I. to ## # ℹ 26,677 more rows ``` --- # Using dplyr verbs ``` r alice |> unnest_tokens(token, text) |> count(token) ``` ``` ## # A tibble: 2,740 × 2 ## token n ## <chr> <int> ## 1 _alice’s 1 ## 2 _all 1 ## 3 _all_ 1 ## 4 _and 1 ## 5 _are_ 4 ## 6 _at 1 ## 7 _before 1 ## 8 _beg_ 1 ## 9 _began_ 1 ## 10 _best_ 2 ## # ℹ 2,730 more rows ``` --- # Using dplyr verbs ``` r alice |> unnest_tokens(token, text) |> count(token, sort = TRUE) ``` ``` ## # A tibble: 2,740 × 2 ## token n ## <chr> <int> ## 1 the 1643 ## 2 and 871 ## 3 to 729 ## 4 a 632 ## 5 she 538 ## 6 it 527 ## 7 of 514 ## 8 said 460 ## 9 i 393 ## 10 alice 386 ## # ℹ 2,730 more rows ``` --- # Using dplyr verbs ``` r alice |> unnest_tokens(token, text) |> count(chapter, token) ``` ``` ## # A tibble: 7,549 × 3 ## chapter token n ## <int> <chr> <int> ## 1 1 _curtseying_ 1 ## 2 1 _never_ 1 ## 3 1 _not_ 1 ## 4 1 _one_ 1 ## 5 1 _poison_ 1 ## 6 1 _that_ 1 ## 7 1 _through_ 1 ## 8 1 _took 1 ## 9 1 _very_ 4 ## 10 1 _was_ 1 ## # ℹ 7,539 more rows ``` --- # Using dplyr verbs ``` r alice |> unnest_tokens(token, text) |> group_by(chapter) |> count(token) |> top_n(10, n) ``` ``` ## # A tibble: 122 × 3 ## # Groups: chapter [12] ## chapter token n ## <int> <chr> <int> ## 1 1 a 52 ## 2 1 alice 27 ## 3 1 and 65 ## 4 1 i 30 ## 5 1 it 62 ## 6 1 of 43 ## 7 1 she 79 ## 8 1 the 92 ## 9 1 to 75 ## 10 1 was 52 ## # ℹ 112 more rows ``` --- # Using dplyr verbs and ggplot2 ``` r alice |> unnest_tokens(token, text) |> count(token) |> top_n(10, n) |> ggplot(aes(n, fct_reorder(token, n))) + geom_col(fill = "orange") + theme_bw() ``` --- # Using dplyr verbs and ggplot2 <img src="data:image/png;base64,#nlp-slides_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Stop words A lot of the words don't tell us very much. Words such as "the", "and", "at" and "for" appear a lot in English text but doesn't add much to the context. Words such as these are called **stop words** For more information about differences in stop words and when to remove them read this chapter https://smltar.com/stopwords --- # Stop words in tidytext `tidytext` comes with a `data.frame` of stop words ``` r head(stop_words) table(stop_words$lexicon) ``` ``` ## # A tibble: 6 × 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## ## onix SMART snowball ## 404 571 174 ``` --- # Stopwords ``` ## [1] "a" "a's" "able" "about" ## [5] "above" "according" "accordingly" "across" ## [9] "actually" "after" "afterwards" "again" ## [13] "against" "ain't" "all" "allow" ## [17] "allows" "almost" "alone" "along" ## [21] "already" "also" "although" "always" ## [25] "am" "among" "amongst" "an" ## [29] "and" "another" "any" "anybody" ## [33] "anyhow" "anyone" "anything" "anyway" ## [37] "anyways" "anywhere" "apart" "appear" ## [41] "appreciate" "appropriate" "are" "aren't" ## [45] "around" "as" "aside" "ask" ## [49] "asking" "associated" "at" "available" ## [53] "away" "awfully" "b" "be" ## [57] "became" "because" "become" "becomes" ## [61] "becoming" "been" "before" "beforehand" ## [65] "behind" "being" "believe" "below" ## [69] "beside" "besides" "best" "better" ## [73] "between" "beyond" "both" "brief" ## [77] "but" "by" "c" "c'mon" ## [81] "c's" "came" "can" "can't" ## [85] "cannot" "cant" "cause" "causes" ## [89] "certain" "certainly" "changes" "clearly" ## [93] "co" "com" "come" "comes" ## [97] "concerning" "consequently" "consider" "considering" ## [101] "contain" "containing" "contains" "corresponding" ## [105] "could" "couldn't" "course" "currently" ## [109] "d" "definitely" "described" "despite" ## [113] "did" "didn't" "different" "do" ## [117] "does" "doesn't" "doing" "don't" ## [121] "done" "down" "downwards" "during" ## [125] "e" "each" "edu" "eg" ## [129] "eight" "either" "else" "elsewhere" ## [133] "enough" "entirely" "especially" "et" ## [137] "etc" "even" "ever" "every" ## [141] "everybody" "everyone" "everything" "everywhere" ## [145] "ex" "exactly" "example" "except" ## [149] "f" "far" "few" "fifth" ## [153] "first" "five" "followed" "following" ## [157] "follows" "for" "former" "formerly" ## [161] "forth" "four" "from" "further" ## [165] "furthermore" "g" "get" "gets" ## [169] "getting" "given" "gives" "go" ## [173] "goes" "going" "gone" "got" ## [177] "gotten" "greetings" "h" "had" ## [181] "hadn't" "happens" "hardly" "has" ## [185] "hasn't" "have" "haven't" "having" ## [189] "he" "he's" "hello" "help" ## [193] "hence" "her" "here" "here's" ## [197] "hereafter" "hereby" "herein" "hereupon" ## [201] "hers" "herself" "hi" "him" ## [205] "himself" "his" "hither" "hopefully" ## [209] "how" "howbeit" "however" "i" ## [213] "i'd" "i'll" "i'm" "i've" ## [217] "ie" "if" "ignored" "immediate" ## [221] "in" "inasmuch" "inc" "indeed" ## [225] "indicate" "indicated" "indicates" "inner" ## [229] "insofar" "instead" "into" "inward" ## [233] "is" "isn't" "it" "it'd" ## [237] "it'll" "it's" "its" "itself" ## [241] "j" "just" "k" "keep" ## [245] "keeps" "kept" "know" "knows" ## [249] "known" "l" "last" "lately" ## [253] "later" "latter" "latterly" "least" ## [257] "less" "lest" "let" "let's" ## [261] "like" "liked" "likely" "little" ## [265] "look" "looking" "looks" "ltd" ## [269] "m" "mainly" "many" "may" ## [273] "maybe" "me" "mean" "meanwhile" ## [277] "merely" "might" "more" "moreover" ## [281] "most" "mostly" "much" "must" ## [285] "my" "myself" "n" "name" ## [289] "namely" "nd" "near" "nearly" ## [293] "necessary" "need" "needs" "neither" ## [297] "never" "nevertheless" "new" "next" ## [301] "nine" "no" "nobody" "non" ## [305] "none" "noone" "nor" "normally" ## [309] "not" "nothing" "novel" "now" ## [313] "nowhere" "o" "obviously" "of" ## [317] "off" "often" "oh" "ok" ## [321] "okay" "old" "on" "once" ## [325] "one" "ones" "only" "onto" ## [329] "or" "other" "others" "otherwise" ## [333] "ought" "our" "ours" "ourselves" ## [337] "out" "outside" "over" "overall" ## [341] "own" "p" "particular" "particularly" ## [345] "per" "perhaps" "placed" "please" ## [349] "plus" "possible" "presumably" "probably" ## [353] "provides" "q" "que" "quite" ## [357] "qv" "r" "rather" "rd" ## [361] "re" "really" "reasonably" "regarding" ## [365] "regardless" "regards" "relatively" "respectively" ## [369] "right" "s" "said" "same" ## [373] "saw" "say" "saying" "says" ## [377] "second" "secondly" "see" "seeing" ## [381] "seem" "seemed" "seeming" "seems" ## [385] "seen" "self" "selves" "sensible" ## [389] "sent" "serious" "seriously" "seven" ## [393] "several" "shall" "she" "should" ## [397] "shouldn't" "since" "six" "so" ## [401] "some" "somebody" "somehow" "someone" ## [405] "something" "sometime" "sometimes" "somewhat" ## [409] "somewhere" "soon" "sorry" "specified" ## [413] "specify" "specifying" "still" "sub" ## [417] "such" "sup" "sure" "t" ## [421] "t's" "take" "taken" "tell" ## [425] "tends" "th" "than" "thank" ## [429] "thanks" "thanx" "that" "that's" ## [433] "thats" "the" "their" "theirs" ## [437] "them" "themselves" "then" "thence" ## [441] "there" "there's" "thereafter" "thereby" ## [445] "therefore" "therein" "theres" "thereupon" ## [449] "these" "they" "they'd" "they'll" ## [453] "they're" "they've" "think" "third" ## [457] "this" "thorough" "thoroughly" "those" ## [461] "though" "three" "through" "throughout" ## [465] "thru" "thus" "to" "together" ## [469] "too" "took" "toward" "towards" ## [473] "tried" "tries" "truly" "try" ## [477] "trying" "twice" "two" "u" ## [481] "un" "under" "unfortunately" "unless" ## [485] "unlikely" "until" "unto" "up" ## [489] "upon" "us" "use" "used" ## [493] "useful" "uses" "using" "usually" ## [497] "uucp" "v" "value" "various" ## [501] "very" "via" "viz" "vs" ## [505] "w" "want" "wants" "was" ## [509] "wasn't" "way" "we" "we'd" ## [513] "we'll" "we're" "we've" "welcome" ## [517] "well" "went" "were" "weren't" ## [521] "what" "what's" "whatever" "when" ## [525] "whence" "whenever" "where" "where's" ## [529] "whereafter" "whereas" "whereby" "wherein" ## [533] "whereupon" "wherever" "whether" "which" ## [537] "while" "whither" "who" "who's" ## [541] "whoever" "whole" "whom" "whose" ## [545] "why" "will" "willing" "wish" ## [549] "with" "within" "without" "won't" ## [553] "wonder" "would" "would" "wouldn't" ## [557] "x" "y" "yes" "yet" ## [561] "you" "you'd" "you'll" "you're" ## [565] "you've" "your" "yours" "yourself" ## [569] "yourselves" "z" "zero" ``` --- # Removing stopwords We can use an `anti_join()` to remove the tokens that also appear in the `stop_words` data.frame ``` r alice |> unnest_tokens(token, text) |> anti_join(stop_words, by = c("token" = "word")) |> count(token, sort = TRUE) ``` ``` ## # A tibble: 2,314 × 2 ## token n ## <chr> <int> ## 1 alice 386 ## 2 time 71 ## 3 queen 68 ## 4 king 61 ## 5 don’t 60 ## 6 it’s 57 ## 7 i’m 56 ## 8 mock 56 ## 9 turtle 56 ## 10 gryphon 55 ## # ℹ 2,304 more rows ``` --- # Anti-join with same variable name ``` r alice |> unnest_tokens(word, text) |> anti_join(stop_words, by = "word") |> count(word, sort = TRUE) ``` ``` ## # A tibble: 2,314 × 2 ## word n ## <chr> <int> ## 1 alice 386 ## 2 time 71 ## 3 queen 68 ## 4 king 61 ## 5 don’t 60 ## 6 it’s 57 ## 7 i’m 56 ## 8 mock 56 ## 9 turtle 56 ## 10 gryphon 55 ## # ℹ 2,304 more rows ``` --- # Stop words removed ``` r alice |> unnest_tokens(word, text) |> anti_join(stop_words, by = "word") |> count(word, sort = TRUE) |> top_n(10, n) |> ggplot(aes(n, fct_reorder(word, n))) + geom_col(fill = "orange") + theme_bw() ``` --- # Stop words removed <img src="data:image/png;base64,#nlp-slides_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Customize Stop Words - If the default lists remove too many or too few words, you can customize. - Many times there are words that are nuisances that may not be in a list, or we might want to remove numbers. In this example, there is not much use in words like don't or it's. We can use `dplyr` to remove this or we can customize our stopword list. - Here we remove punctuation, remove numbers, remove contractions, and remove specific words after punctuation is removed. - Example: ``` r custom_stopwords <- c("n't", "'s", "'m", "'ll", "'ve", "'re", "’s", "’m", "’ll", "’ve", "’re", "dont", "im", "its", "doesnt", "didnt", "wasnt", "werent", "havent", "isnt", "arent", "youre", "theyll", "hed", "shell", "whats", "thats") ``` --- # Customize Stop Words We actually find that there is an issue with the apostrophe in the text, so we first convert "’" to "'" and then remove stopwords and filter out custom stopwords. ``` r alice_sw <- alice |> mutate(text = str_replace_all(text, "’", "'")) |> unnest_tokens(word, text, token = "words") |> anti_join(stop_words, by = "word") |> filter(!word %in% custom_stopwords) |> filter(!str_detect(word, "^[0-9]+$")) |> filter(word != "") ``` --- # Customize Stop Words <img src="data:image/png;base64,#nlp-slides_files/figure-html/stopwords6-1.png" style="display: block; margin: auto;" /> --- # Wordcloud - A wordcloud is a visual that shows the most common words larger as the word in a visualization. - It helps to quickly identify common words and themes in text data. - It is used often in media to show trending words in websites or social media. --- # Wordcloud ``` r alice_sw |> count(word, sort = TRUE) |> top_n(40, n) |> wordcloud2(size = 1, color = "random-light", backgroundColor = "lightgray") ```
--- ## Which words appear together? **ngrams** are n consecutive word, we can count these to see what words appears together. -- - ngram with n = 1 are called unigrams: "which", "words", "appears", "together" - ngram with n = 2 are called bigrams: "which words", "words appears", "appears together" - ngram with n = 3 are called trigrams: "which words appears", "words appears together" --- ## Which words appears together? We can extract bigrams using `unnest_ngrams()` with `n = 2` ``` r alice |> unnest_ngrams(ngram, text, n = 2) ``` ``` ## # A tibble: 25,170 × 3 ## chapter chapter_name ngram ## <int> <chr> <chr> ## 1 1 CHAPTER I. chapter i ## 2 1 CHAPTER I. down the ## 3 1 CHAPTER I. the rabbit ## 4 1 CHAPTER I. rabbit hole ## 5 1 CHAPTER I. <NA> ## 6 1 CHAPTER I. <NA> ## 7 1 CHAPTER I. alice was ## 8 1 CHAPTER I. was beginning ## 9 1 CHAPTER I. beginning to ## 10 1 CHAPTER I. to get ## # ℹ 25,160 more rows ``` --- ## Bi-grams Tallying up the bi-grams still shows a lot of stop words but is able to pick up relationships ``` r alice |> unnest_ngrams(ngram, text, n = 2) |> count(ngram, sort = TRUE) ``` ``` ## # A tibble: 13,424 × 2 ## ngram n ## <chr> <int> ## 1 <NA> 951 ## 2 said the 206 ## 3 of the 130 ## 4 said alice 112 ## 5 in a 96 ## 6 and the 75 ## 7 in the 75 ## 8 it was 72 ## 9 to the 68 ## 10 the queen 60 ## # ℹ 13,414 more rows ``` --- ## Bi-grams ``` r alice |> unnest_ngrams(ngram, text, n = 2) |> separate(ngram, into = c("word1", "word2"), sep = " ") |> select(word1, word2) ``` ``` ## # A tibble: 25,170 × 2 ## word1 word2 ## <chr> <chr> ## 1 chapter i ## 2 down the ## 3 the rabbit ## 4 rabbit hole ## 5 <NA> <NA> ## 6 <NA> <NA> ## 7 alice was ## 8 was beginning ## 9 beginning to ## 10 to get ## # ℹ 25,160 more rows ``` --- ## Bi-grams Filter words that are paired with alice. ``` r alice |> unnest_ngrams(ngram, text, n = 2) |> separate(ngram, into = c("word1", "word2"), sep = " ") |> select(word1, word2) |> filter(word1 == "alice") ``` ``` ## # A tibble: 336 × 2 ## word1 word2 ## <chr> <chr> ## 1 alice was ## 2 alice think ## 3 alice started ## 4 alice after ## 5 alice had ## 6 alice to ## 7 alice had ## 8 alice had ## 9 alice soon ## 10 alice began ## # ℹ 326 more rows ``` --- ## Bi-grams ``` r alice |> unnest_ngrams(ngram, text, n = 2) |> separate(ngram, into = c("word1", "word2"), sep = " ") |> select(word1, word2) |> filter(word1 == "alice") |> count(word2, sort = TRUE) ``` ``` ## # A tibble: 133 × 2 ## word2 n ## <chr> <int> ## 1 and 18 ## 2 was 17 ## 3 thought 12 ## 4 as 11 ## 5 said 11 ## 6 could 10 ## 7 had 10 ## 8 did 9 ## 9 in 9 ## 10 to 9 ## # ℹ 123 more rows ``` --- ## Bi-grams Filter stop words, remove some _ punctuation, and keep words that are paired with "alice". ``` r alice |> mutate(text = str_replace_all(text, "’", "'")) |> unnest_tokens(bigram, text, token = "ngrams", n = 2) |> separate(bigram, into = c("word1", "word2"), sep = " ") |> mutate(word1 = str_replace_all(word1, "[[:punct:]_]", "")) |> filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word, word2 == "alice") |> count(word1, sort = TRUE) |> slice_max(n, n = 10, with_ties = FALSE) |> ggplot(aes(reorder(word1, n), n)) + geom_col(fill = "orange") + coord_flip() + theme_bw() ``` --- ## Bi-grams <img src="data:image/png;base64,#nlp-slides_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # TF-IDF TF: **Term frequency** gives weight to terms that appear a lot. It's a measure of how important a word may be and how frequently a word occurs within a document (e.g. a book chapter). IDF: **Inverse document frequency** decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents (e.g. all chapters in a book). Some words that occur many times in a document may not be important; in English, these are probably words like “the”, “is”, “of”, and so forth. We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a sophisticated approach to adjusting term frequency for commonly used words. --- # TF-IDF TF measures how often a word appears in a document. `$$TF = \frac{\text{Number of times the term appears in a document}}{\text{Total number of terms in that document}}$$` --- # TF-IDF IDF measures how rare a word is across all documents. IDF decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. The inverse document frequency for any given term is defined as `$$IDF = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the term}}\right)$$` --- # TF-IDF TF-IDF: TF and IDF can be combined (the two quantities multiplied together), which is the frequency of a term adjusted for how rarely it is used. `$$\text{TF-IDF} = TF \times IDF$$` The idea of TF-IDF is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents. A high TF-IDF means it is an important word in a specific document. A low TF-IDF means it is a common word with less importance. --- ## TF-IDF with tidytext - We are finding important words by chapter - A high TF-IDF score means a word is important in a specific chapter but not common across all chapters. ``` r alice_tfidf<-alice |> unnest_tokens(word, text) |> count(word, chapter) |> bind_tf_idf(word, chapter, n) |> arrange(desc(tf_idf)) top_tfidf <- alice_tfidf |> group_by(chapter) |> slice_max(tf_idf, n = 5) |> ungroup() ``` --- ## Top 5 TF-IDF by Chapter <img src="data:image/png;base64,#nlp-slides_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- # Sentiment Analysis - Sentiment Analysis is a process of extracting opinions that have different scores like positive, negative or neutral. - Based on sentiment analysis, you can find out the nature of opinion or sentences in text. - Sentiment Analysis is a type of classification where the data are classified into different classes like positive or negative or happy, sad, angry, etc. --- ## Sentiment Analysis - Positive and negative sentiments from "bing". - The "affin" sentiments scores are from very negative (-5) to very positives (+5). - The "nrc" sentiments are categorized into anger, fear, joy,... ``` r bing_sentiments <- get_sentiments("bing") top_sentiment_words <- alice |> unnest_tokens(word, text) |> anti_join(stop_words, by = "word") |> inner_join(bing_sentiments, by = "word") |> count(word, sentiment, sort = TRUE) |> group_by(sentiment) |> slice_max(n, n = 5) |> ungroup() ``` --- ## Sentiment Analysis <table class="table table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Word </th> <th style="text-align:left;"> Sentiment </th> <th style="text-align:right;"> Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> mock </td> <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td> <td style="text-align:right;"> 56 </td> </tr> <tr> <td style="text-align:left;"> poor </td> <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td> <td style="text-align:right;"> 27 </td> </tr> <tr> <td style="text-align:left;"> hastily </td> <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> mad </td> <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> anxiously </td> <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> beautiful </td> <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> majesty </td> <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> glad </td> <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:left;"> bright </td> <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> eagerly </td> <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> ready </td> <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> top </td> <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td> <td style="text-align:right;"> 8 </td> </tr> </tbody> </table> --- # Sentiment Analysis <img src="data:image/png;base64,#nlp-slides_files/figure-html/sentiment3-1.png" style="display: block; margin: auto;" /> --- # Topic Modeling Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for (Silge and Robinson, Text Mining with R) <img src="data:image/png;base64,#images/tidymodels.png" width="50%" style="display: block; margin: auto;" /> --- ## Topic Modeling with `topicmodels` One method for topic modeling is *Latent Dirichlet allocation (LDA)*. It is guided model to discover topics in a collection of documents, and classifies words into these topics. For example a two-topic model of news articles could include "sports" and "politics". Words that would go into sports could include "hockey", "basketball", "football", etc. and those that might go into politics include "election", "prime minister", "mayor", etc. LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. --- ## Topic Modeling with `topicmodels` We need the `topicmodels` package as well as `tm`. To apply the models, we need to create a document-term matrix. This is a matrix where: - each row represents one document (such as a book or article), - each column represents one term, and - each value (typically) contains the number of appearances of that term in that document. - the function requires a document: this could be the full text or words within chapters of the text. --- ## Term-Document Matrix ``` r library(tm) library(topicmodels) alice_dtm <- alice |> unnest_tokens(word, text) |> mutate(word = str_replace_all(word, "[[:punct:]_]", "")) |> filter(!str_detect(word, "^[0-9]+$")) |> anti_join(stop_words, by = "word") |> filter(!word %in% c("im", "ill", "dont", "ive")) |> mutate(document = "full_text") |> count(document, word) |> cast_dtm(document, word, n) ``` --- ## LDA LDA assigns words in each document randomly to a topic. The algorithm iteratively refines topic assignments based on two probabilities: - How frequently a word appears in a topic across all documents. - How frequently topics appear in a document. We select the number of topics (k). The model adjusts topic assignments until stable distribution appears and LDA gives the top words in each topic. ``` r alice_lda <- LDA(alice_dtm, k = 4, control = list(seed = 1234)) ``` --- ## Visualizing LDA ``` r alice_top_terms <- tidy(alice_lda, matrix = "beta") |> group_by(topic) |> slice_max(beta, n = 10) |> ungroup() |> arrange(topic, -beta) alice_top_terms |> mutate(term = reorder_within(term, beta, topic)) %>% ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + facet_wrap(~ topic, scales = "free") + theme_bw()+ scale_y_reordered() ``` --- ## Visualizing LDA <img src="data:image/png;base64,#nlp-slides_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ## LDA with more topics <img src="data:image/png;base64,#nlp-slides_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ## More on customizing stopwords For the most part we got rid of our stopwords and other nuissance words with filtering. The following shows ideas of how to create new stop word lists if you want to append to `stop_words` in `tidytext`. ``` r new_stops <- c("chapter","series_","_the","well", "way","now","illustration", "york", "sons", "company", "1916", "gabriel", "sam'l", "v", "vi", "vii", "viii","xi","x","xii","xii.","10","11","12", "10,","12,","c(1,","12),", "alice", "dinah", "sister","storyland", "series", "copyright", "saml", "alice's", "alices", "said","like", "little", "went", "came", "one","just","i'm","_i_") # need a lexicon column custom <- rep("CUSTOM",length(new_stops)) # create tibble custom_stop_words <- tibble(word=new_stops, lexicon=custom) # Bind the custom stop words to stop_words stop_words2 <- rbind(stop_words, custom_stop_words) ```