Text Mining

<div class="my-footer"><span>Filippo Chiarello, Ph.D.</span></div>

---

# WELCOME!

## [III] DS 4 SCI projects (basics): text analysis (1)

### Text Mining Using Tidy Data Principles

---

# Introducing Myself

<img src="figs/myself.jpg" width="150px"/> .large[Filippo Chiarello, Ph.D.]

- .large[Researcher, University of Pisa]

- .large[Co-Founder and Member, B4DS Lab (http://b4ds.unipi.it/)]

- .large[Co-Founder and CTO, Texty s.r.l. (http://texty.biz/)]

- .large[Research Consultant, Errequadro s.r.l. (https://www.errequadrosrl.com/)]

---

## **About the Lessons**

- .large[Everything is Designed to make your coding life easier 👾]

- .large[There are no dumb questions 🧐]

- .large[When you talk, it is preferred to turn your camera on 🎥]

- .large[Let's keep in touch:]

- By email: send a message to `filippo.chiarello@unipi.it`
  - Via Linkedin: https://www.linkedin.com/in/filippo-chiarello-2b382770/

---

## Importance of TM and NLP for Companies

- .large[When?]

- .large[What?]

- .large[Where?]

- .large[Who?]

- .large[WHY?]

---

# HOW??

## TIDY DATA PRINCIPLES + TEXT MINING

---

background-image: url(figs/tidytext_repo.png)
background-size: 800px
background-position: 50% 20%

---

background-image: url(figs/cover.png)
background-size: 450px
background-position: 50% 50%

---

# <i class="fa fa-github"></i>

# GitHub repo for Text Mining Lessons:

---

## Plan for introductory lessons

- .large[*Lesson 1*: Text Mining Using Tidy Data Principles (12.10)]

- .large[*Lesson 2*: Modeling for text (15.10)]

- .large[Log in to RStudio Cloud]

---
class: middle, center

# <i class="fa fa-cloud"></i>

# Go here and log in (free):

.large[[bit.ly/rstudio-text-course](https://rstudio.cloud/spaces/95227/join?access_code=e%2FcgTuSzM5egcw0WiSP3ThXTXjIHYqMaSA6rJb3q)]

---

## Let's install some packages

```r
install.packages(c("tidyverse", 
                   "tidytext", 
                   "gutenbergr"))
```

---

## **What do we mean by tidy text?**

```r
text <- c("Tell all the truth but tell it slant —",
          "Success in Circuit lies",
          "Too bright for our infirm Delight",
          "The Truth's superb surprise",
          "As Lightning to the Children eased",
          "With explanation kind",
          "The Truth must dazzle gradually",
          "Or every man be blind —")

text
```

```
## [1] "Tell all the truth but tell it slant —"
## [2] "Success in Circuit lies"               
## [3] "Too bright for our infirm Delight"     
## [4] "The Truth's superb surprise"           
## [5] "As Lightning to the Children eased"    
## [6] "With explanation kind"                 
## [7] "The Truth must dazzle gradually"       
## [8] "Or every man be blind —"
```

---

## **What do we mean by tidy text?**

```r
library(tidyverse)

text_df <- tibble(line = 1:8, text = text)

text_df
```

```
## # A tibble: 8 x 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Tell all the truth but tell it slant —
## 2     2 Success in Circuit lies               
## 3     3 Too bright for our infirm Delight     
## 4     4 The Truth's superb surprise           
## 5     5 As Lightning to the Children eased    
## 6     6 With explanation kind                 
## 7     7 The Truth must dazzle gradually       
## 8     8 Or every man be blind —
```

---

## **What do we mean by tidy text?**

```r
library(tidytext)

text_df %>%
* unnest_tokens(word, text)
```

```
## # A tibble: 41 x 2
##     line word   
##    <int> <chr>  
##  1     1 tell   
##  2     1 all    
##  3     1 the    
##  4     1 truth  
##  5     1 but    
##  6     1 tell   
##  7     1 it     
##  8     1 slant  
##  9     2 success
## 10     2 in     
## # … with 31 more rows
```

---

## Pop Quiz

- .unscramble[more]
- .unscramble[fewer]

---

## **Gathering more data**

.large[You can access the full text of many public domain works from [Project Gutenberg](https://www.gutenberg.org/) using the [gutenbergr](https://ropensci.org/tutorials/gutenbergr_tutorial.html) package.]

```r
library(gutenbergr)

full_text <- gutenberg_download(6522)
```

---

## **Time to tidy your text!**

```r
tidy_book <- full_text %>%
  mutate(line = row_number()) %>%
  unnest_tokens(word, text)

tidy_book %>% 
  sample_n(10) %>% 
  glimpse()
```

```
## Rows: 10
## Columns: 3
## $ gutenberg_id <int> 6522, 6522, 6522, 6522, 6522, 6522, 6522, 6522, 6522, 65…
## $ line         <int> 1728, 909, 1712, 921, 1456, 873, 982, 1906, 848, 360
## $ word         <chr> "world", "the", "the", "whiteness", "is", "of", "she", "…
```

---

## Your turn 1

```r
# tidy_book <- full_text %>%
#   mutate(line = row_number()) %>%
#  _____
```

---

## Your turn 2

```r
tidy_book %>%
  count(word, sort = TRUE)
```

---

## Your turn 2

```r
tidy_book %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 2,106 x 2
##    word      n
##    <chr> <int>
##  1 the     882
##  2 and     358
##  3 of      337
##  4 in      270
##  5 my      250
##  6 to      247
##  7 i       236
##  8 is      152
##  9 you     136
## 10 your    134
## # … with 2,096 more rows
```

---

## **Stop words**

```r
get_stopwords()
```

```
## # A tibble: 175 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # … with 165 more rows
```

---

## **Stop words**

```r
get_stopwords(language = "es")
```

```
## # A tibble: 308 x 2
##    word  lexicon 
##    <chr> <chr>   
##  1 de    snowball
##  2 la    snowball
##  3 que   snowball
##  4 el    snowball
##  5 en    snowball
##  6 y     snowball
##  7 a     snowball
##  8 los   snowball
##  9 del   snowball
## 10 se    snowball
## # … with 298 more rows
```

---

## **Stop words**

```r
get_stopwords(language = "pt")
```

```
## # A tibble: 203 x 2
##    word  lexicon 
##    <chr> <chr>   
##  1 de    snowball
##  2 a     snowball
##  3 o     snowball
##  4 que   snowball
##  5 e     snowball
##  6 do    snowball
##  7 da    snowball
##  8 em    snowball
##  9 um    snowball
## 10 para  snowball
## # … with 193 more rows
```

---

## **Stop words**

```r
get_stopwords(source = "smart")
```

```
## # A tibble: 571 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           smart  
##  2 a's         smart  
##  3 able        smart  
##  4 about       smart  
##  5 above       smart  
##  6 according   smart  
##  7 accordingly smart  
##  8 across      smart  
##  9 actually    smart  
## 10 after       smart  
## # … with 561 more rows
```

---

## **What are the most common words?**

```r
tidy_book %>%
  anti_join(get_stopwords(source = "smart")) %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
* ggplot(aes(fct_reorder(word, n), n)) +
  geom_col() +
  coord_flip()
```

---

## Your turn 3

**U N S C R A M B L E**

---

![](intro_files/figure-html/unnamed-chunk-15-1.png)

---

# SENTIMENT ANALYSIS
# Using lexicons

---

## **Sentiment lexicons**

```r
get_sentiments("bing")
```

```
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
```

---

## **Implementing sentiment analysis**

```r
tidy_book %>%
* inner_join(get_sentiments("bing")) %>%
  count(sentiment, sort = TRUE)
```

```
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative    470
## 2 positive    356
```

---

## Your turn 4

Implement sentiment analysis with an `inner_join()`

```r
# tidy_book %>%
#   #___(get_sentiments("bing")) %>%
#   count(sentiment, sort = TRUE)
```

---

## Your turn 5

```r
tidy_book %>%
  inner_join(get_sentiments("bing")) %>%            
  count(sentiment, word, sort = TRUE) 
```

---

## **Implementing sentiment analysis**

```r
tidy_book %>%
  inner_join(get_sentiments("bing")) %>%            
* count(sentiment, word, sort = TRUE)
```

```
## # A tibble: 360 x 3
##    sentiment word       n
##    <chr>     <chr>  <int>
##  1 positive  love      33
##  2 negative  dust      28
##  3 positive  like      28
##  4 negative  dark      24
##  5 positive  joy       17
##  6 negative  death     15
##  7 positive  master    15
##  8 negative  pain      13
##  9 positive  glad      10
## 10 positive  silent    10
## # … with 350 more rows
```

---

## Lets Visuialize

```r
tidy_book %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word, sort = TRUE) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup %>%
* ggplot(aes(fct_reorder(word, n),
             n, 
             fill = sentiment)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ sentiment, scales = "free") 
```

---

![](intro_files/figure-html/unnamed-chunk-22-1.png)

---

# WHAT IS A DOCUMENT ABOUT?
# Using tf-idf metric

---

## **What is a document about?**

- .large[Term frequency]
- .large[Inverse document frequency]

`$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$`

### tf-idf is about comparing **documents** within a **collection**.

---

## **Understanding tf-idf**

```r
full_collection <- gutenberg_download(c(1272, 5614, 15076, 29185, 29186),
                                      meta_fields = "title")
```
---

## **Understanding tf-idf**

```r
full_collection %>% 
  count(title)
```

```
## # A tibble: 5 x 2
##   title                                                             n
##   <chr>                                                         <int>
## 1 "Chess Strategy"                                              13512
## 2 "National Strategy for Combating Terrorism\nFebruary 2003"     1361
## 3 "National Strategy for Combating Terrorism\nSeptember 2006"    1037
## 4 "Some Principles of Maritime Strategy"                         9966
## 5 "The Riddle of the Rhine: Chemical Strategy in Peace and War"  8413
```

---

## Your turn 7

```r
book_words <- full_collection %>%
  unnest_tokens(word, text) %>%        
  count(title, word, sort = TRUE)
```

What do the columns of `book_words` tell us?

---

## **Calculating tf-idf**

```r
book_tfidf <- book_words %>%
  bind_tf_idf(word, title, n)
```

---

## **Calculating tf-idf**

```r
book_tfidf
```

```
## # A tibble: 20,872 x 6
##    title                                         word      n     tf   idf tf_idf
##    <chr>                                         <chr> <int>  <dbl> <dbl>  <dbl>
##  1 Some Principles of Maritime Strategy          the    8051 0.0778 0     0     
##  2 The Riddle of the Rhine: Chemical Strategy i… the    6150 0.0802 0     0     
##  3 Some Principles of Maritime Strategy          of     4613 0.0446 0     0     
##  4 Chess Strategy                                the    4104 0.0549 0     0     
##  5 The Riddle of the Rhine: Chemical Strategy i… of     3820 0.0498 0     0     
##  6 Chess Strategy                                p      3768 0.0504 0.511 0.0258
##  7 Some Principles of Maritime Strategy          to     3630 0.0351 0     0     
##  8 Some Principles of Maritime Strategy          and    2320 0.0224 0     0     
##  9 Some Principles of Maritime Strategy          in     2191 0.0212 0     0     
## 10 Some Principles of Maritime Strategy          a      1929 0.0186 0     0     
## # … with 20,862 more rows
```

---

## Your turn 8

```r
book_tfidf %>%
  arrange(-tf_idf)
```

---

## Your turn 8

```r
book_tfidf %>%
  arrange(-tf_idf)
```

```
## # A tibble: 20,872 x 6
##    title                                     word        n      tf   idf  tf_idf
##    <chr>                                     <chr>   <int>   <dbl> <dbl>   <dbl>
##  1 "Chess Strategy"                          kt       1479 0.0198  1.61  0.0319 
##  2 "Chess Strategy"                          p        3768 0.0504  0.511 0.0258 
##  3 "Chess Strategy"                          q         977 0.0131  1.61  0.0211 
##  4 "Chess Strategy"                          k        1240 0.0166  0.916 0.0152 
##  5 "Chess Strategy"                          r        1155 0.0155  0.916 0.0142 
##  6 "Chess Strategy"                          pawn      523 0.00700 1.61  0.0113 
##  7 "Chess Strategy"                          b        1528 0.0205  0.511 0.0105 
##  8 "Chess Strategy"                          white     805 0.0108  0.916 0.00988
##  9 "The Riddle of the Rhine: Chemical Strat… gas       773 0.0101  0.916 0.00924
## 10 "National Strategy for Combating Terrori… terror…   106 0.00998 0.916 0.00915
## # … with 20,862 more rows
```

---

## **Calculating tf-idf**

```r
book_tfidf %>%
  group_by(title) %>%
  top_n(10) %>%
  ungroup %>%
* ggplot(aes(fct_reorder(word, tf_idf),
             tf_idf, 
             fill = title)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~title, scales = "free")
```

---

![](intro_files/figure-html/unnamed-chunk-31-1.png)

---

## **N-grams... and beyond!**

```r
tidy_ngram <- full_text %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

tidy_ngram
```

```
## # A tibble: 10,471 x 2
##    gutenberg_id bigram              
##           <int> <chr>               
##  1         6522 original html       
##  2         6522 html version        
##  3         6522 version created     
##  4         6522 created at          
##  5         6522 at eldritchpress.org
##  6         6522 eldritchpress.org by
##  7         6522 by eric             
##  8         6522 eric eldred         
##  9         6522 eldred this         
## 10         6522 this ebook          
## # … with 10,461 more rows
```

---

## **N-grams... and beyond!**

```r
tidy_ngram %>%
  count(bigram, sort = TRUE)
```

```
## # A tibble: 7,465 x 2
##    bigram       n
##    <chr>    <int>
##  1 of the     103
##  2 in the     102
##  3 and the     41
##  4 to the      34
##  5 my heart    29
##  6 in my       27
##  7 my life     26
##  8 with the    25
##  9 of my       24
## 10 on the      23
## # … with 7,455 more rows
```

---

## Your Turn 10

- .large[Yes!]
- .large[No]

---

## **N-grams... and beyond!**

```r
bigram_counts <- tidy_ngram %>%
* separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  count(word1, word2, sort = TRUE)
```

---

## **N-grams... and beyond!**

```r
bigram_counts
```

```
## # A tibble: 1,063 x 3
##    word1    word2       n
##    <chr>    <chr>   <int>
##  1 lord     buddha      7
##  2 morning  light       5
##  3 thy      trumpet     5
##  4 art      thou        3
##  5 earthen  lamp        3
##  6 heart    timid       3
##  7 lose     heart       3
##  8 city     wall        2
##  9 clan     art         2
## 10 daylight faded       2
## # … with 1,053 more rows
```

---

## What can you do with n-grams?

- .large[tf-idf of n-grams]

- .large[network analysis]

- .large[negation]

---

---

## **What can you do with n-grams?**

### [She Giggles, He Gallops](https://pudding.cool/2017/08/screen-direction/)

---

background-image: url(figs/change_overall-1.svg)
background-size: contain
background-position: center

---

# Thanks!

Slides created with [**remark.js**](http://remarkjs.com/) and the R package [**xaringan**](https://github.com/yihui/xaringan)