layout: true <div class="my-footer"><span>Filippo Chiarello, Ph.D.</span></div> --- class: inverse, center, bottom background-image: url(figs/lib_pic.jpg) background-size: cover # WELCOME! ## [III] DS 4 SCI projects (basics): text analysis (1) ### Text Mining Using Tidy Data Principles --- # Introducing Myself <img src="figs/myself.jpg" width="150px"/> .large[Filippo Chiarello, Ph.D.] -- - .large[Researcher, University of Pisa] -- - .large[Co-Founder and Member, B4DS Lab (http://b4ds.unipi.it/)] -- - .large[Co-Founder and CTO, Texty s.r.l. (http://texty.biz/)] -- - .large[Research Consultant, Errequadro s.r.l. (https://www.errequadrosrl.com/)] --- ## **About the Lessons** -- - .large[Everything is Designed to make your coding life easier 👾] -- - .large[There are no dumb questions 🧐] -- - .large[When you talk, it is preferred to turn your camera on 🎥] -- - .large[Let's keep in touch:] - By email: send a message to `filippo.chiarello@unipi.it` - Via Linkedin: https://www.linkedin.com/in/filippo-chiarello-2b382770/ --- ## Importance of TM and NLP for Companies -- - .large[When?] -- - .large[What?] -- - .large[Where?] -- - .large[Who?] -- - .large[WHY?] --- background-image: url(figs/p_and_p_cover.png) background-size: cover class: inverse, center, middle # HOW?? ## TIDY DATA PRINCIPLES + TEXT MINING --- background-image: url(figs/tidytext_repo.png) background-size: 800px background-position: 50% 20% class: bottom, right .large[[https://github.com/juliasilge/tidytext](https://github.com/juliasilge/tidytext)] .large[[https://tidytextmining.com/](https://tidytextmining.com/)] --- background-image: url(figs/cover.png) background-size: 450px background-position: 50% 50% --- class: middle, center # <i class="fa fa-github"></i> # GitHub repo for Text Mining Lessons: .large[[github.com/FilippoChiarello/text-mining](https://github.com/FilippoChiarello/text-mining)] --- class: inverse ## Plan for introductory lessons -- - .large[*Lesson 1*: Text Mining Using Tidy Data Principles (12.10)] -- - .large[*Lesson 2*: Modeling for text (15.10)] -- - .large[Log in to RStudio Cloud] --- class: middle, center # <i class="fa fa-cloud"></i> # Go here and log in (free): .large[[bit.ly/rstudio-text-course](https://rstudio.cloud/spaces/95227/join?access_code=e%2FcgTuSzM5egcw0WiSP3ThXTXjIHYqMaSA6rJb3q)] --- ## Let's install some packages ```r install.packages(c("tidyverse", "tidytext", "gutenbergr")) ``` --- ## **What do we mean by tidy text?** ```r text <- c("Tell all the truth but tell it slant —", "Success in Circuit lies", "Too bright for our infirm Delight", "The Truth's superb surprise", "As Lightning to the Children eased", "With explanation kind", "The Truth must dazzle gradually", "Or every man be blind —") text ``` ``` ## [1] "Tell all the truth but tell it slant —" ## [2] "Success in Circuit lies" ## [3] "Too bright for our infirm Delight" ## [4] "The Truth's superb surprise" ## [5] "As Lightning to the Children eased" ## [6] "With explanation kind" ## [7] "The Truth must dazzle gradually" ## [8] "Or every man be blind —" ``` --- ## **What do we mean by tidy text?** ```r library(tidyverse) text_df <- tibble(line = 1:8, text = text) text_df ``` ``` ## # A tibble: 8 x 2 ## line text ## <int> <chr> ## 1 1 Tell all the truth but tell it slant — ## 2 2 Success in Circuit lies ## 3 3 Too bright for our infirm Delight ## 4 4 The Truth's superb surprise ## 5 5 As Lightning to the Children eased ## 6 6 With explanation kind ## 7 7 The Truth must dazzle gradually ## 8 8 Or every man be blind — ``` --- ## **What do we mean by tidy text?** ```r library(tidytext) text_df %>% * unnest_tokens(word, text) ``` ``` ## # A tibble: 41 x 2 ## line word ## <int> <chr> ## 1 1 tell ## 2 1 all ## 3 1 the ## 4 1 truth ## 5 1 but ## 6 1 tell ## 7 1 it ## 8 1 slant ## 9 2 success ## 10 2 in ## # … with 31 more rows ``` --- ## Pop Quiz .large[A tidy text dataset typically has] - .unscramble[more] - .unscramble[fewer] .large[rows than the original, non-tidy text dataset.] --- ## **Gathering more data** .large[You can access the full text of many public domain works from [Project Gutenberg](https://www.gutenberg.org/) using the [gutenbergr](https://ropensci.org/tutorials/gutenbergr_tutorial.html) package.] ```r library(gutenbergr) full_text <- gutenberg_download(6522) ``` <img src="figs/military.jpg" width="250px"/> --- ## **Time to tidy your text!** ```r tidy_book <- full_text %>% mutate(line = row_number()) %>% unnest_tokens(word, text) tidy_book %>% sample_n(10) %>% glimpse() ``` ``` ## Rows: 10 ## Columns: 3 ## $ gutenberg_id <int> 6522, 6522, 6522, 6522, 6522, 6522, 6522, 6522, 6522, 65… ## $ line <int> 1728, 909, 1712, 921, 1456, 873, 982, 1906, 848, 360 ## $ word <chr> "world", "the", "the", "whiteness", "is", "of", "she", "… ``` --- ## Your turn 1 ```r # tidy_book <- full_text %>% # mutate(line = row_number()) %>% # _____ ``` --- ## Your turn 2 .large[What do you predict will happen if we run the following code?] ```r tidy_book %>% count(word, sort = TRUE) ``` --- ## Your turn 2 .large[What do you predict will happen if we run the following code? ```r tidy_book %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 2,106 x 2 ## word n ## <chr> <int> ## 1 the 882 ## 2 and 358 ## 3 of 337 ## 4 in 270 ## 5 my 250 ## 6 to 247 ## 7 i 236 ## 8 is 152 ## 9 you 136 ## 10 your 134 ## # … with 2,096 more rows ``` --- ## **Stop words** ```r get_stopwords() ``` ``` ## # A tibble: 175 x 2 ## word lexicon ## <chr> <chr> ## 1 i snowball ## 2 me snowball ## 3 my snowball ## 4 myself snowball ## 5 we snowball ## 6 our snowball ## 7 ours snowball ## 8 ourselves snowball ## 9 you snowball ## 10 your snowball ## # … with 165 more rows ``` --- ## **Stop words** ```r get_stopwords(language = "es") ``` ``` ## # A tibble: 308 x 2 ## word lexicon ## <chr> <chr> ## 1 de snowball ## 2 la snowball ## 3 que snowball ## 4 el snowball ## 5 en snowball ## 6 y snowball ## 7 a snowball ## 8 los snowball ## 9 del snowball ## 10 se snowball ## # … with 298 more rows ``` --- ## **Stop words** ```r get_stopwords(language = "pt") ``` ``` ## # A tibble: 203 x 2 ## word lexicon ## <chr> <chr> ## 1 de snowball ## 2 a snowball ## 3 o snowball ## 4 que snowball ## 5 e snowball ## 6 do snowball ## 7 da snowball ## 8 em snowball ## 9 um snowball ## 10 para snowball ## # … with 193 more rows ``` --- ## **Stop words** ```r get_stopwords(source = "smart") ``` ``` ## # A tibble: 571 x 2 ## word lexicon ## <chr> <chr> ## 1 a smart ## 2 a's smart ## 3 able smart ## 4 about smart ## 5 above smart ## 6 according smart ## 7 accordingly smart ## 8 across smart ## 9 actually smart ## 10 after smart ## # … with 561 more rows ``` --- ## **What are the most common words?** ```r tidy_book %>% anti_join(get_stopwords(source = "smart")) %>% count(word, sort = TRUE) %>% top_n(20) %>% * ggplot(aes(fct_reorder(word, n), n)) + geom_col() + coord_flip() ``` --- ## Your turn 3 **U N S C R A M B L E** --- <!-- --> --- background-image: url(figs/p_and_p_cover.png) background-size: cover class: inverse, center, middle # SENTIMENT ANALYSIS # Using lexicons --- ## **Sentiment lexicons** ```r get_sentiments("bing") ``` ``` ## # A tibble: 6,786 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ## 7 abomination negative ## 8 abort negative ## 9 aborted negative ## 10 aborts negative ## # … with 6,776 more rows ``` --- ## **Implementing sentiment analysis** ```r tidy_book %>% * inner_join(get_sentiments("bing")) %>% count(sentiment, sort = TRUE) ``` ``` ## # A tibble: 2 x 2 ## sentiment n ## <chr> <int> ## 1 negative 470 ## 2 positive 356 ``` --- ## Your turn 4 Implement sentiment analysis with an `inner_join()` ```r # tidy_book %>% # #___(get_sentiments("bing")) %>% # count(sentiment, sort = TRUE) ``` --- ## Your turn 5 .large[What do you predict will happen if we run the following code?] ```r tidy_book %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word, sort = TRUE) ``` --- ## **Implementing sentiment analysis** .large[What do you predict will happen if we run the following code?] ```r tidy_book %>% inner_join(get_sentiments("bing")) %>% * count(sentiment, word, sort = TRUE) ``` ``` ## # A tibble: 360 x 3 ## sentiment word n ## <chr> <chr> <int> ## 1 positive love 33 ## 2 negative dust 28 ## 3 positive like 28 ## 4 negative dark 24 ## 5 positive joy 17 ## 6 negative death 15 ## 7 positive master 15 ## 8 negative pain 13 ## 9 positive glad 10 ## 10 positive silent 10 ## # … with 350 more rows ``` --- ## Lets Visuialize ```r tidy_book %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word, sort = TRUE) %>% group_by(sentiment) %>% top_n(10) %>% ungroup %>% * ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) + geom_col() + coord_flip() + facet_wrap(~ sentiment, scales = "free") ``` --- class: middle <!-- --> --- background-image: url(figs/p_and_p_cover.png) background-size: cover class: inverse, center, middle # WHAT IS A DOCUMENT ABOUT? # Using tf-idf metric --- ## **What is a document about?** - .large[Term frequency] - .large[Inverse document frequency] `$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$` ### tf-idf is about comparing **documents** within a **collection**. --- ## **Understanding tf-idf** .large[Make a collection (*corpus*) for yourself ```r full_collection <- gutenberg_download(c(1272, 5614, 15076, 29185, 29186), meta_fields = "title") ``` --- ## **Understanding tf-idf** .large[Experiment with a collection (*corpus*) for yourself!] ```r full_collection %>% count(title) ``` ``` ## # A tibble: 5 x 2 ## title n ## <chr> <int> ## 1 "Chess Strategy" 13512 ## 2 "National Strategy for Combating Terrorism\nFebruary 2003" 1361 ## 3 "National Strategy for Combating Terrorism\nSeptember 2006" 1037 ## 4 "Some Principles of Maritime Strategy" 9966 ## 5 "The Riddle of the Rhine: Chemical Strategy in Peace and War" 8413 ``` --- ## Your turn 7 ```r book_words <- full_collection %>% unnest_tokens(word, text) %>% count(title, word, sort = TRUE) ``` What do the columns of `book_words` tell us? --- ## **Calculating tf-idf** ```r book_tfidf <- book_words %>% bind_tf_idf(word, title, n) ``` --- ## **Calculating tf-idf** ```r book_tfidf ``` ``` ## # A tibble: 20,872 x 6 ## title word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 Some Principles of Maritime Strategy the 8051 0.0778 0 0 ## 2 The Riddle of the Rhine: Chemical Strategy i… the 6150 0.0802 0 0 ## 3 Some Principles of Maritime Strategy of 4613 0.0446 0 0 ## 4 Chess Strategy the 4104 0.0549 0 0 ## 5 The Riddle of the Rhine: Chemical Strategy i… of 3820 0.0498 0 0 ## 6 Chess Strategy p 3768 0.0504 0.511 0.0258 ## 7 Some Principles of Maritime Strategy to 3630 0.0351 0 0 ## 8 Some Principles of Maritime Strategy and 2320 0.0224 0 0 ## 9 Some Principles of Maritime Strategy in 2191 0.0212 0 0 ## 10 Some Principles of Maritime Strategy a 1929 0.0186 0 0 ## # … with 20,862 more rows ``` --- ## Your turn 8 .large[What do you predict will happen if we run the following code? ```r book_tfidf %>% arrange(-tf_idf) ``` --- ## Your turn 8 .large[What do you predict will happen if we run the following code? ```r book_tfidf %>% arrange(-tf_idf) ``` ``` ## # A tibble: 20,872 x 6 ## title word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 "Chess Strategy" kt 1479 0.0198 1.61 0.0319 ## 2 "Chess Strategy" p 3768 0.0504 0.511 0.0258 ## 3 "Chess Strategy" q 977 0.0131 1.61 0.0211 ## 4 "Chess Strategy" k 1240 0.0166 0.916 0.0152 ## 5 "Chess Strategy" r 1155 0.0155 0.916 0.0142 ## 6 "Chess Strategy" pawn 523 0.00700 1.61 0.0113 ## 7 "Chess Strategy" b 1528 0.0205 0.511 0.0105 ## 8 "Chess Strategy" white 805 0.0108 0.916 0.00988 ## 9 "The Riddle of the Rhine: Chemical Strat… gas 773 0.0101 0.916 0.00924 ## 10 "National Strategy for Combating Terrori… terror… 106 0.00998 0.916 0.00915 ## # … with 20,862 more rows ``` --- ## **Calculating tf-idf** ```r book_tfidf %>% group_by(title) %>% top_n(10) %>% ungroup %>% * ggplot(aes(fct_reorder(word, tf_idf), tf_idf, fill = title)) + geom_col(show.legend = FALSE) + coord_flip() + facet_wrap(~title, scales = "free") ``` --- <!-- --> --- ## **N-grams... and beyond!** ```r tidy_ngram <- full_text %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) tidy_ngram ``` ``` ## # A tibble: 10,471 x 2 ## gutenberg_id bigram ## <int> <chr> ## 1 6522 original html ## 2 6522 html version ## 3 6522 version created ## 4 6522 created at ## 5 6522 at eldritchpress.org ## 6 6522 eldritchpress.org by ## 7 6522 by eric ## 8 6522 eric eldred ## 9 6522 eldred this ## 10 6522 this ebook ## # … with 10,461 more rows ``` --- ## **N-grams... and beyond!** ```r tidy_ngram %>% count(bigram, sort = TRUE) ``` ``` ## # A tibble: 7,465 x 2 ## bigram n ## <chr> <int> ## 1 of the 103 ## 2 in the 102 ## 3 and the 41 ## 4 to the 34 ## 5 my heart 29 ## 6 in my 27 ## 7 my life 26 ## 8 with the 25 ## 9 of my 24 ## 10 on the 23 ## # … with 7,455 more rows ``` --- ## Your Turn 10 .large[Can we remove stop words?] - .large[Yes!] - .large[No] --- ## **N-grams... and beyond!** ```r bigram_counts <- tidy_ngram %>% * separate(bigram, c("word1", "word2"), sep = " ") %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>% count(word1, word2, sort = TRUE) ``` --- ## **N-grams... and beyond!** ```r bigram_counts ``` ``` ## # A tibble: 1,063 x 3 ## word1 word2 n ## <chr> <chr> <int> ## 1 lord buddha 7 ## 2 morning light 5 ## 3 thy trumpet 5 ## 4 art thou 3 ## 5 earthen lamp 3 ## 6 heart timid 3 ## 7 lose heart 3 ## 8 city wall 2 ## 9 clan art 2 ## 10 daylight faded 2 ## # … with 1,053 more rows ``` --- background-image: url(figs/p_and_p_cover.png) background-size: cover class: inverse ## What can you do with n-grams? - .large[tf-idf of n-grams] -- - .large[network analysis] -- - .large[negation] --- background-image: url(figs/austen-1.png) background-size: 750px --- background-image: url(figs/slider.gif) background-position: 50% 70% ## **What can you do with n-grams?** ### [She Giggles, He Gallops](https://pudding.cool/2017/08/screen-direction/) --- background-image: url(figs/change_overall-1.svg) background-size: contain background-position: center --- class: left, middle # Thanks! Slides created with [**remark.js**](http://remarkjs.com/) and the R package [**xaringan**](https://github.com/yihui/xaringan)