Text Mining (Natural Language Processing)

class: center, middle, title-slide

.title[
# Text Mining (Natural Language Processing)
]
.subtitle[
## JSC 370: Data Science II
]
.date[
### February 24, 2025
]

---

a, a > code {
  color: #4393C3;
  text-decoration: none;
}

strong {
  color: #000000;
}

em {
  color: #08519C;
  font-style: normal;
  font-weight: bold;
}

del {
  color: #E5E5E5;
  text-decoration: none;
  font-weight: bold;
}

/* General slides */
.remark-code {
  font-family: 'Hack', 'Lucida Console', Monaco, monospace;
  font-size: 65%;
}

.remark-inline-code {
  background: #F5F5F5;
  border-radius: 3px;
  padding: 4px;
}

.remark-slide-content {
  color: #474747;
  font-weight: 300;
  font-size: 180%;
  padding: 1em 2em 1em 2em;
}

/* Slide headers */
.remark-slide-content h1 {
  color: #444444;
  font-family: Montserrat;
}

.remark-slide-content h2 {
  color: #444444;
  font-family: Montserrat;
}

.remark-slide-content h3 {
  color: #444444;
  font-family: Montserrat;
}

.remark-slide-content h4 {
  color: #444444;
  font-family: Montserrat;
}

.remark-slide-content h5 {
  color: #444444;
  font-family: Montserrat;
}

/* Images */
img, video, iframe {
  max-width: 100%;
}

/* Lists */
li {
  padding: 10px;
  display: block;
}

li::before {
  content: "- ";
  font-weight: bold;
}

/* Title slide */
.title-slide {
  background-color: #FFFFFF;
}

.title-slide .remark-slide-number {
  display: none;
}

.title-slide h1 {
  color: #101645;
  font-size: 72px;
  text-shadow: none;
  font-family: Montserrat;
  text-align: center;
  vertical-align: bottom;
}

.title-slide h2 {
  color: #101645;
  text-shadow: none;
  font-family: Montserrat;
  font-size: 48px;
  text-align: center;
  font-weight: bold;
}

.title-slide h3 {
  color: #333333;
  text-shadow: none;
  font-family: Montserrat;
  font-size: 36px;
  text-align: center;
  margin-bottom: 10px;
}

.title-slide h4 {
  color: #333333;
  text-shadow: none;
  font-family: Montserrat;
  font-size: 24px;
  text-align: center;
  margin-bottom: 10px;
}

.hidden {
  visibility: hidden;
}
</style>

<style type="text/css">
code.*, .remark-code, pre {
  font-size:15px;
}

body{
  font-family: Helvetica;
  font-size: 12pt;
}

p,h1,h2,h3,h4 {
  font-family: system-ui
}

.html-widget {
    margin: auto;
}

code.r{ /* Code block */
    font-size: 20px;
}
pre { /* Code block - determines code spacing between lines */
    font-size: 20px;
}

</style>

# What is NLP?

Natural Language Processing (NLP) is used for <u> qualitative data </u> that is collected using open ended or free form text from a survey, medical provider notes in an electronic medical record (EMR), or a transcript of research participant interviews (Koleck et al., 2019).

It is also called 'text mining'.
---

# What is NLP used for?

- Looking at frequencies of words and phrases in text.
- Labeling relationships between words such as subject, object, modification.
- Identify entities in free text, labeling them with types such as person, location, organization.
- Coupled with AI it can predict words (autocomplete).

---

# How can we do NLP?

- We turn text into numbers.
- Then use R and the tidyverse to explore those numbers.

---

# Why tidytext?

Works seemlessly with ggplot2, dplyr and tidyr.

**Alternatives:**

**R**: quanteda, tm, koRpus

**Python**: nltk, Spacy, gensim

---

# Alice's Adventures in Wonderland

Download the alice dataset from [here]("https://github.com/JSC370/jsc370-2022/blob/main/data/text/alice.rds"). There are 12 chapters. For `tidytext` to work properly, the text should be in a `tibble` (alice is a `tiblble`).

``` r
alice <- readRDS("alice.rds")
alice
```

```
## # A tibble: 3,351 × 3
##    text                                                     chapter chapter_name
##    <chr>                                                      <int> <chr>       
##  1 "CHAPTER I."                                                   1 CHAPTER I.  
##  2 "Down the Rabbit-Hole"                                         1 CHAPTER I.  
##  3 ""                                                             1 CHAPTER I.  
##  4 ""                                                             1 CHAPTER I.  
##  5 "Alice was beginning to get very tired of sitting by he…       1 CHAPTER I.  
##  6 "bank, and of having nothing to do: once or twice she h…       1 CHAPTER I.  
##  7 "the book her sister was reading, but it had no picture…       1 CHAPTER I.  
##  8 "conversations in it, “and what is the use of a book,” …       1 CHAPTER I.  
##  9 "“without pictures or conversations?”"                         1 CHAPTER I.  
## 10 ""                                                             1 CHAPTER I.  
## # ℹ 3,341 more rows
```

---

# Tokenizing

Turning text into smaller units, essentially splitting a sentence, phrase, paragraph or entire document into smaller units called tokens (i.e. individual words, numbers, or punctuation marks). Tokenization is needed for natural language processing.

In English:

- split by spaces
- more advanced algorithms

---

# Spacy tokenizer

![](data:image/png;base64,#images/spacy.png)

---

## Tokenizing with unnest_tokens

``` r
alice |>
  unnest_tokens(token, text)
```

```
## # A tibble: 26,687 × 3
##    chapter chapter_name token    
##      <int> <chr>        <chr>    
##  1       1 CHAPTER I.   chapter  
##  2       1 CHAPTER I.   i        
##  3       1 CHAPTER I.   down     
##  4       1 CHAPTER I.   the      
##  5       1 CHAPTER I.   rabbit   
##  6       1 CHAPTER I.   hole     
##  7       1 CHAPTER I.   alice    
##  8       1 CHAPTER I.   was      
##  9       1 CHAPTER I.   beginning
## 10       1 CHAPTER I.   to       
## # ℹ 26,677 more rows
```
---

## Tokenizing with spaCy

``` python
import pandas as pd
import spacy
import pandas as pd
import spacy

r_alice = r.alice 
alice_py= r_alice.to_pandas()

nlp = spacy.load("en_core_web_sm")

# Tokenization using spaCy
alice_py["tokens"] = alice_py["text"].apply(lambda x: [token.text for token in nlp(x)])
alice_py
```
---

# Words as a unit

Now that we have words as the observation unit we can use the **dplyr** toolbox.

---

# Using dplyr verbs

``` r
alice |>
  unnest_tokens(token, text)
```

---

# Using dplyr verbs

``` r
alice |>
  unnest_tokens(token, text) |>
  count(token)
```

```
## # A tibble: 2,740 × 2
##    token        n
##    <chr>    <int>
##  1 _alice’s     1
##  2 _all         1
##  3 _all_        1
##  4 _and         1
##  5 _are_        4
##  6 _at          1
##  7 _before      1
##  8 _beg_        1
##  9 _began_      1
## 10 _best_       2
## # ℹ 2,730 more rows
```

---

# Using dplyr verbs

``` r
alice |>
  unnest_tokens(token, text) |>
  count(token, sort = TRUE)
```

```
## # A tibble: 2,740 × 2
##    token     n
##    <chr> <int>
##  1 the    1643
##  2 and     871
##  3 to      729
##  4 a       632
##  5 she     538
##  6 it      527
##  7 of      514
##  8 said    460
##  9 i       393
## 10 alice   386
## # ℹ 2,730 more rows
```

---

# Using dplyr verbs

``` r
alice |>
  unnest_tokens(token, text) |>
  count(chapter, token)
```

```
## # A tibble: 7,549 × 3
##    chapter token            n
##      <int> <chr>        <int>
##  1       1 _curtseying_     1
##  2       1 _never_          1
##  3       1 _not_            1
##  4       1 _one_            1
##  5       1 _poison_         1
##  6       1 _that_           1
##  7       1 _through_        1
##  8       1 _took            1
##  9       1 _very_           4
## 10       1 _was_            1
## # ℹ 7,539 more rows
```

---

# Using dplyr verbs

``` r
alice |>
  unnest_tokens(token, text) |>
  group_by(chapter) |>
  count(token) |>
  top_n(10, n)
```

```
## # A tibble: 122 × 3
## # Groups:   chapter [12]
##    chapter token     n
##      <int> <chr> <int>
##  1       1 a        52
##  2       1 alice    27
##  3       1 and      65
##  4       1 i        30
##  5       1 it       62
##  6       1 of       43
##  7       1 she      79
##  8       1 the      92
##  9       1 to       75
## 10       1 was      52
## # ℹ 112 more rows
```

---

# Using dplyr verbs and ggplot2

``` r
alice |>
  unnest_tokens(token, text) |>
  count(token) |>
  top_n(10, n) |>
  ggplot(aes(n, fct_reorder(token, n))) +
  geom_col(fill = "orange") +
  theme_bw()
```
---

# Using dplyr verbs and ggplot2
<img src="data:image/png;base64,#nlp-slides_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />

---

# Stop words

A lot of the words don't tell us very much. Words such as "the", "and", "at" and "for" appear a lot in English text but doesn't add much to the context.

Words such as these are called **stop words**

For more information about differences in stop words and when to remove them read this chapter https://smltar.com/stopwords

---

# Stop words in tidytext

`tidytext` comes with a `data.frame` of stop words

``` r
head(stop_words)
table(stop_words$lexicon)
```

```
## # A tibble: 6 × 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART  
## 
##     onix    SMART snowball 
##      404      571      174
```

---

# Stopwords

```
##   [1] "a"             "a's"           "able"          "about"        
##   [5] "above"         "according"     "accordingly"   "across"       
##   [9] "actually"      "after"         "afterwards"    "again"        
##  [13] "against"       "ain't"         "all"           "allow"        
##  [17] "allows"        "almost"        "alone"         "along"        
##  [21] "already"       "also"          "although"      "always"       
##  [25] "am"            "among"         "amongst"       "an"           
##  [29] "and"           "another"       "any"           "anybody"      
##  [33] "anyhow"        "anyone"        "anything"      "anyway"       
##  [37] "anyways"       "anywhere"      "apart"         "appear"       
##  [41] "appreciate"    "appropriate"   "are"           "aren't"       
##  [45] "around"        "as"            "aside"         "ask"          
##  [49] "asking"        "associated"    "at"            "available"    
##  [53] "away"          "awfully"       "b"             "be"           
##  [57] "became"        "because"       "become"        "becomes"      
##  [61] "becoming"      "been"          "before"        "beforehand"   
##  [65] "behind"        "being"         "believe"       "below"        
##  [69] "beside"        "besides"       "best"          "better"       
##  [73] "between"       "beyond"        "both"          "brief"        
##  [77] "but"           "by"            "c"             "c'mon"        
##  [81] "c's"           "came"          "can"           "can't"        
##  [85] "cannot"        "cant"          "cause"         "causes"       
##  [89] "certain"       "certainly"     "changes"       "clearly"      
##  [93] "co"            "com"           "come"          "comes"        
##  [97] "concerning"    "consequently"  "consider"      "considering"  
## [101] "contain"       "containing"    "contains"      "corresponding"
## [105] "could"         "couldn't"      "course"        "currently"    
## [109] "d"             "definitely"    "described"     "despite"      
## [113] "did"           "didn't"        "different"     "do"           
## [117] "does"          "doesn't"       "doing"         "don't"        
## [121] "done"          "down"          "downwards"     "during"       
## [125] "e"             "each"          "edu"           "eg"           
## [129] "eight"         "either"        "else"          "elsewhere"    
## [133] "enough"        "entirely"      "especially"    "et"           
## [137] "etc"           "even"          "ever"          "every"        
## [141] "everybody"     "everyone"      "everything"    "everywhere"   
## [145] "ex"            "exactly"       "example"       "except"       
## [149] "f"             "far"           "few"           "fifth"        
## [153] "first"         "five"          "followed"      "following"    
## [157] "follows"       "for"           "former"        "formerly"     
## [161] "forth"         "four"          "from"          "further"      
## [165] "furthermore"   "g"             "get"           "gets"         
## [169] "getting"       "given"         "gives"         "go"           
## [173] "goes"          "going"         "gone"          "got"          
## [177] "gotten"        "greetings"     "h"             "had"          
## [181] "hadn't"        "happens"       "hardly"        "has"          
## [185] "hasn't"        "have"          "haven't"       "having"       
## [189] "he"            "he's"          "hello"         "help"         
## [193] "hence"         "her"           "here"          "here's"       
## [197] "hereafter"     "hereby"        "herein"        "hereupon"     
## [201] "hers"          "herself"       "hi"            "him"          
## [205] "himself"       "his"           "hither"        "hopefully"    
## [209] "how"           "howbeit"       "however"       "i"            
## [213] "i'd"           "i'll"          "i'm"           "i've"         
## [217] "ie"            "if"            "ignored"       "immediate"    
## [221] "in"            "inasmuch"      "inc"           "indeed"       
## [225] "indicate"      "indicated"     "indicates"     "inner"        
## [229] "insofar"       "instead"       "into"          "inward"       
## [233] "is"            "isn't"         "it"            "it'd"         
## [237] "it'll"         "it's"          "its"           "itself"       
## [241] "j"             "just"          "k"             "keep"         
## [245] "keeps"         "kept"          "know"          "knows"        
## [249] "known"         "l"             "last"          "lately"       
## [253] "later"         "latter"        "latterly"      "least"        
## [257] "less"          "lest"          "let"           "let's"        
## [261] "like"          "liked"         "likely"        "little"       
## [265] "look"          "looking"       "looks"         "ltd"          
## [269] "m"             "mainly"        "many"          "may"          
## [273] "maybe"         "me"            "mean"          "meanwhile"    
## [277] "merely"        "might"         "more"          "moreover"     
## [281] "most"          "mostly"        "much"          "must"         
## [285] "my"            "myself"        "n"             "name"         
## [289] "namely"        "nd"            "near"          "nearly"       
## [293] "necessary"     "need"          "needs"         "neither"      
## [297] "never"         "nevertheless"  "new"           "next"         
## [301] "nine"          "no"            "nobody"        "non"          
## [305] "none"          "noone"         "nor"           "normally"     
## [309] "not"           "nothing"       "novel"         "now"          
## [313] "nowhere"       "o"             "obviously"     "of"           
## [317] "off"           "often"         "oh"            "ok"           
## [321] "okay"          "old"           "on"            "once"         
## [325] "one"           "ones"          "only"          "onto"         
## [329] "or"            "other"         "others"        "otherwise"    
## [333] "ought"         "our"           "ours"          "ourselves"    
## [337] "out"           "outside"       "over"          "overall"      
## [341] "own"           "p"             "particular"    "particularly" 
## [345] "per"           "perhaps"       "placed"        "please"       
## [349] "plus"          "possible"      "presumably"    "probably"     
## [353] "provides"      "q"             "que"           "quite"        
## [357] "qv"            "r"             "rather"        "rd"           
## [361] "re"            "really"        "reasonably"    "regarding"    
## [365] "regardless"    "regards"       "relatively"    "respectively" 
## [369] "right"         "s"             "said"          "same"         
## [373] "saw"           "say"           "saying"        "says"         
## [377] "second"        "secondly"      "see"           "seeing"       
## [381] "seem"          "seemed"        "seeming"       "seems"        
## [385] "seen"          "self"          "selves"        "sensible"     
## [389] "sent"          "serious"       "seriously"     "seven"        
## [393] "several"       "shall"         "she"           "should"       
## [397] "shouldn't"     "since"         "six"           "so"           
## [401] "some"          "somebody"      "somehow"       "someone"      
## [405] "something"     "sometime"      "sometimes"     "somewhat"     
## [409] "somewhere"     "soon"          "sorry"         "specified"    
## [413] "specify"       "specifying"    "still"         "sub"          
## [417] "such"          "sup"           "sure"          "t"            
## [421] "t's"           "take"          "taken"         "tell"         
## [425] "tends"         "th"            "than"          "thank"        
## [429] "thanks"        "thanx"         "that"          "that's"       
## [433] "thats"         "the"           "their"         "theirs"       
## [437] "them"          "themselves"    "then"          "thence"       
## [441] "there"         "there's"       "thereafter"    "thereby"      
## [445] "therefore"     "therein"       "theres"        "thereupon"    
## [449] "these"         "they"          "they'd"        "they'll"      
## [453] "they're"       "they've"       "think"         "third"        
## [457] "this"          "thorough"      "thoroughly"    "those"        
## [461] "though"        "three"         "through"       "throughout"   
## [465] "thru"          "thus"          "to"            "together"     
## [469] "too"           "took"          "toward"        "towards"      
## [473] "tried"         "tries"         "truly"         "try"          
## [477] "trying"        "twice"         "two"           "u"            
## [481] "un"            "under"         "unfortunately" "unless"       
## [485] "unlikely"      "until"         "unto"          "up"           
## [489] "upon"          "us"            "use"           "used"         
## [493] "useful"        "uses"          "using"         "usually"      
## [497] "uucp"          "v"             "value"         "various"      
## [501] "very"          "via"           "viz"           "vs"           
## [505] "w"             "want"          "wants"         "was"          
## [509] "wasn't"        "way"           "we"            "we'd"         
## [513] "we'll"         "we're"         "we've"         "welcome"      
## [517] "well"          "went"          "were"          "weren't"      
## [521] "what"          "what's"        "whatever"      "when"         
## [525] "whence"        "whenever"      "where"         "where's"      
## [529] "whereafter"    "whereas"       "whereby"       "wherein"      
## [533] "whereupon"     "wherever"      "whether"       "which"        
## [537] "while"         "whither"       "who"           "who's"        
## [541] "whoever"       "whole"         "whom"          "whose"        
## [545] "why"           "will"          "willing"       "wish"         
## [549] "with"          "within"        "without"       "won't"        
## [553] "wonder"        "would"         "would"         "wouldn't"     
## [557] "x"             "y"             "yes"           "yet"          
## [561] "you"           "you'd"         "you'll"        "you're"       
## [565] "you've"        "your"          "yours"         "yourself"     
## [569] "yourselves"    "z"             "zero"
```

---

# Removing stopwords

We can use an `anti_join()` to remove the tokens that also appear in the `stop_words` data.frame

``` r
alice |>
  unnest_tokens(token, text) |>
  anti_join(stop_words, by = c("token" = "word")) |>
  count(token, sort = TRUE)
```

```
## # A tibble: 2,314 × 2
##    token       n
##    <chr>   <int>
##  1 alice     386
##  2 time       71
##  3 queen      68
##  4 king       61
##  5 don’t      60
##  6 it’s       57
##  7 i’m        56
##  8 mock       56
##  9 turtle     56
## 10 gryphon    55
## # ℹ 2,304 more rows
```

---

# Anti-join with same variable name

``` r
alice |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by = "word") |>
  count(word, sort = TRUE)
```

```
## # A tibble: 2,314 × 2
##    word        n
##    <chr>   <int>
##  1 alice     386
##  2 time       71
##  3 queen      68
##  4 king       61
##  5 don’t      60
##  6 it’s       57
##  7 i’m        56
##  8 mock       56
##  9 turtle     56
## 10 gryphon    55
## # ℹ 2,304 more rows
```

---

# Stop words removed

---

# Stop words removed

---

# Customize Stop Words

- If the default lists remove too many or too few words, you can customize.
- Many times there are words that are nuisances that may not be in a list, or we might want to remove numbers. In this example, there is not much use in words like don't or it's. We can use `dplyr` to remove this or we can customize our stopword list.
- Here we remove punctuation, remove numbers, remove contractions, and remove specific words after punctuation is removed. 
- Example:

``` r
custom_stopwords <- c("n't", "'s", "'m", "'ll", "'ve", "'re",
                      "’s", "’m", "’ll", "’ve", "’re", 
                      "dont", "im", "its", "doesnt", "didnt", "wasnt", "werent", "havent",
                      "isnt", "arent", "youre", "theyll", "hed", "shell", "whats", "thats")
```
---

# Customize Stop Words
We actually find that there is an issue with the apostrophe in the text, so we first convert "’" to "'" and then remove stopwords and filter out custom stopwords.

# Customize Stop Words
<img src="data:image/png;base64,#nlp-slides_files/figure-html/stopwords6-1.png" style="display: block; margin: auto;" />

---

# Wordcloud

- A wordcloud is a visual that shows the most common words larger as the word in a visualization.
- It helps to quickly identify common words and themes in text data.
- It is used often in media to show trending words in websites or social media.

---

# Wordcloud

``` r
alice_sw |>
  count(word, sort = TRUE) |>  
  top_n(40, n) |>  
  wordcloud2(size = 1, color = "random-light", backgroundColor = "lightgray") 
```

<div class="wordcloud2 html-widget html-fill-item" id="htmlwidget-88c63f0f2a4f6ca74cf8" style="width:75%;height:504px;"></div>
<script type="application/json" data-for="htmlwidget-88c63f0f2a4f6ca74cf8">{"x":{"word":["alice","time","queen","king","mock","turtle","gryphon","hatter","head","voice","rabbit","looked","mouse","round","tone","dormouse","duchess","cat","march","found","hare","door","heard","white","day","dear","eyes","moment","replied","caterpillar","poor","added","half","jury","hand","minute","till","words","cried","sort"],"freq":[386,71,68,61,56,56,55,55,49,48,47,45,43,41,40,39,39,35,34,32,31,30,30,30,29,29,29,29,29,27,27,23,23,22,21,21,21,21,20,20],"fontFamily":"Segoe UI","fontWeight":"bold","color":"random-light","minSize":0,"weightFactor":0.4663212435233161,"backgroundColor":"lightgray","gridSize":0,"minRotation":-0.7853981633974483,"maxRotation":0.7853981633974483,"shuffle":true,"rotateRatio":0.4,"shape":"circle","ellipticity":0.65,"figBase64":null,"hover":null},"evals":[],"jsHooks":{"render":[{"code":"function(el,x){\n                        console.log(123);\n                        if(!iii){\n                          window.location.reload();\n                          iii = False;\n\n                        }\n  }","data":null}]}}</script>

---

## Which words appear together?

**ngrams** are n consecutive word, we can count these to see what words appears together.

- ngram with n = 1 are called unigrams: "which", "words", "appears", "together"
- ngram with n = 2 are called bigrams: "which words", "words appears", "appears together"
- ngram with n = 3 are called trigrams: "which words appears", "words appears together"

---

## Which words appears together?

We can extract bigrams using `unnest_ngrams()` with `n = 2`

``` r
alice |>
  unnest_ngrams(ngram, text, n = 2)
```

```
## # A tibble: 25,170 × 3
##    chapter chapter_name ngram        
##      <int> <chr>        <chr>        
##  1       1 CHAPTER I.   chapter i    
##  2       1 CHAPTER I.   down the     
##  3       1 CHAPTER I.   the rabbit   
##  4       1 CHAPTER I.   rabbit hole  
##  5       1 CHAPTER I.   <NA>         
##  6       1 CHAPTER I.   <NA>         
##  7       1 CHAPTER I.   alice was    
##  8       1 CHAPTER I.   was beginning
##  9       1 CHAPTER I.   beginning to 
## 10       1 CHAPTER I.   to get       
## # ℹ 25,160 more rows
```

---

## Bi-grams

Tallying up the bi-grams still shows a lot of stop words but is able to pick up relationships

``` r
alice |>
  unnest_ngrams(ngram, text, n = 2) |>
  count(ngram, sort = TRUE)
```

```
## # A tibble: 13,424 × 2
##    ngram          n
##    <chr>      <int>
##  1 <NA>         951
##  2 said the     206
##  3 of the       130
##  4 said alice   112
##  5 in a          96
##  6 and the       75
##  7 in the        75
##  8 it was        72
##  9 to the        68
## 10 the queen     60
## # ℹ 13,414 more rows
```

---

## Bi-grams

``` r
alice |>
  unnest_ngrams(ngram, text, n = 2) |>
  separate(ngram, into = c("word1", "word2"), sep = " ") |>
  select(word1, word2)
```

```
## # A tibble: 25,170 × 2
##    word1     word2    
##    <chr>     <chr>    
##  1 chapter   i        
##  2 down      the      
##  3 the       rabbit   
##  4 rabbit    hole     
##  5 <NA>      <NA>     
##  6 <NA>      <NA>     
##  7 alice     was      
##  8 was       beginning
##  9 beginning to       
## 10 to        get      
## # ℹ 25,160 more rows
```

---
## Bi-grams

Filter words that are paired with alice.

``` r
alice |>
  unnest_ngrams(ngram, text, n = 2) |>
  separate(ngram, into = c("word1", "word2"), sep = " ") |>
  select(word1, word2) |>
  filter(word1 == "alice")
```

```
## # A tibble: 336 × 2
##    word1 word2  
##    <chr> <chr>  
##  1 alice was    
##  2 alice think  
##  3 alice started
##  4 alice after  
##  5 alice had    
##  6 alice to     
##  7 alice had    
##  8 alice had    
##  9 alice soon   
## 10 alice began  
## # ℹ 326 more rows
```

---
## Bi-grams

```
## # A tibble: 133 × 2
##    word2       n
##    <chr>   <int>
##  1 and        18
##  2 was        17
##  3 thought    12
##  4 as         11
##  5 said       11
##  6 could      10
##  7 had        10
##  8 did         9
##  9 in          9
## 10 to          9
## # ℹ 123 more rows
```

---

## Bi-grams

Filter stop words, remove some _ punctuation, and keep words that are paired with "alice".

``` r
alice |>
  mutate(text = str_replace_all(text, "’", "'")) |>  
  unnest_tokens(bigram, text, token = "ngrams", n = 2) |> 
  separate(bigram, into = c("word1", "word2"), sep = " ") |> 
  mutate(word1 = str_replace_all(word1, "[[:punct:]_]", "")) |> 
  filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word, word2 == "alice") |>  
  count(word1, sort = TRUE) |> 
  slice_max(n, n = 10, with_ties = FALSE) |>  
  ggplot(aes(reorder(word1, n), n)) +
  geom_col(fill = "orange") +
  coord_flip() +
  theme_bw()
```

---

## Bi-grams

---

# TF-IDF

TF: **Term frequency** gives weight to terms that appear a lot. It's a measure of how important a word may be and how frequently a word occurs within a document (e.g. a book chapter).
IDF: **Inverse document frequency** decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents (e.g. all chapters in a book).

Some words that occur many times in a document may not be important; in English, these are probably words like “the”, “is”, “of”, and so forth. We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a sophisticated approach to adjusting term frequency for commonly used words.

---

# TF-IDF

TF measures how often a word appears in a document.

`$$TF = \frac{\text{Number of times the term appears in a document}}{\text{Total number of terms in that document}}$$`

---

# TF-IDF

IDF measures how rare a word is across all documents. IDF decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.

The inverse document frequency for any given term is defined as 
`$$IDF = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the term}}\right)$$`

---

# TF-IDF

TF-IDF: TF and IDF can be combined (the two quantities multiplied together), which is the frequency of a term adjusted for how rarely it is used. 
`$$\text{TF-IDF} = TF \times IDF$$`

The idea of TF-IDF is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents. A high TF-IDF means it is an important word in a specific document. A low TF-IDF means it is a common word with less importance.

---

## TF-IDF with tidytext

- We are finding important words by chapter
- A high TF-IDF score means a word is important in a specific chapter but not common across all chapters.

``` r
alice_tfidf<-alice |>
  unnest_tokens(word, text) |>
  count(word, chapter) |>
  bind_tf_idf(word, chapter, n) |>
  arrange(desc(tf_idf))

top_tfidf <- alice_tfidf |> 
  group_by(chapter) |> 
  slice_max(tf_idf, n = 5) |> 
  ungroup()
```
---

## Top 5 TF-IDF by Chapter
<img src="data:image/png;base64,#nlp-slides_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />

---

# Sentiment Analysis

- Sentiment Analysis is a process of extracting opinions that have different scores like positive, negative or neutral.
- Based on sentiment analysis, you can find out the nature of opinion or sentences in text.
- Sentiment Analysis is a type of classification where the data are classified into different classes like positive or negative or happy, sad, angry, etc.
---

## Sentiment Analysis

- Positive and negative sentiments from  "bing".
- The "affin" sentiments scores are from very negative (-5) to very positives (+5).
- The "nrc" sentiments are categorized into anger, fear, joy,...

## Sentiment Analysis
<table class="table table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Word </th>
   <th style="text-align:left;"> Sentiment </th>
   <th style="text-align:right;"> Count </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> mock </td>
   <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td>
   <td style="text-align:right;"> 56 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> poor </td>
   <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td>
   <td style="text-align:right;"> 27 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> hastily </td>
   <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td>
   <td style="text-align:right;"> 16 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> mad </td>
   <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td>
   <td style="text-align:right;"> 15 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> anxiously </td>
   <td style="text-align:left;font-weight: bold;color: red !important;"> negative </td>
   <td style="text-align:right;"> 14 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> beautiful </td>
   <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> majesty </td>
   <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> glad </td>
   <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td>
   <td style="text-align:right;"> 11 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bright </td>
   <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> eagerly </td>
   <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ready </td>
   <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> top </td>
   <td style="text-align:left;font-weight: bold;color: green !important;"> positive </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
</tbody>
</table>

---

# Sentiment Analysis

---

# Topic Modeling
Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for (Silge and Robinson, Text Mining with R)

<img src="data:image/png;base64,#images/tidymodels.png" width="50%" style="display: block; margin: auto;" />
---

## Topic Modeling with `topicmodels`

One method for topic modeling is *Latent Dirichlet allocation (LDA)*. It is guided model to discover topics in a collection of documents, and classifies words into these topics.

For example a two-topic model of news articles could include "sports" and "politics". Words that would go into sports could include "hockey", "basketball", "football", etc. and those that might go into politics include "election", "prime minister", "mayor", etc.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document.

---

## Topic Modeling with `topicmodels`

We need the `topicmodels` package as well as `tm`.

To apply the models, we need to create a document-term matrix. This is a matrix where:

- each row represents one document (such as a book or article),
- each column represents one term, and
- each value (typically) contains the number of appearances of that term in that document.
- the function requires a document: this could be the full text or words within chapters of the text.
    
---

## Term-Document Matrix

``` r
library(tm)
library(topicmodels)

---

## LDA

LDA assigns words in each document randomly to a topic. The algorithm iteratively refines topic assignments based on two probabilities:
- How frequently a word appears in a topic across all documents.
- How frequently topics appear in a document.

We select the number of topics (k).

The model adjusts topic assignments until stable distribution appears and LDA gives the top words in each topic.

``` r
alice_lda <- LDA(alice_dtm, k = 4, control = list(seed = 1234))
```

---

## Visualizing LDA

``` r
alice_top_terms <- 
  tidy(alice_lda, matrix = "beta") |>
  group_by(topic) |>
  slice_max(beta, n = 10) |> 
  ungroup() |>
  arrange(topic, -beta)

alice_top_terms |>
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  theme_bw()+
  scale_y_reordered()
```
---

## Visualizing LDA

---

## LDA with more topics

---

## More on customizing stopwords
For the most part we got rid of our stopwords and other nuissance words with filtering. The following shows ideas of how to create new stop word lists if you want to append to `stop_words` in `tidytext`.

``` r
new_stops <- c("chapter","series_","_the","well", "way","now","illustration", "york", "sons", "company", "1916", "gabriel", "sam'l", "v", "vi", "vii", "viii","xi","x","xii","xii.","10","11","12", "10,","12,","c(1,","12),", "alice", "dinah", "sister","storyland", "series", "copyright", "saml", "alice's", "alices", "said","like", "little", "went", "came", "one","just","i'm","_i_")

# need a lexicon column
custom <- rep("CUSTOM",length(new_stops))

# create tibble
custom_stop_words <- tibble(word=new_stops, lexicon=custom)

# Bind the custom stop words to stop_words
stop_words2 <- rbind(stop_words, custom_stop_words)
```