Midterm: TidyText Project

Package Overview

TidyText

TidyText was created for the process of text mining, used especially within tidyverse, to be able to analyze and visualize text. The easy manipulation of text is essential to the process of text mining and interpreting natural language processing. This package allows the R user to manipulate text as a user would manipulate any kind of traditional data. This package delves into tidying text using unnesting functions, performing sentiment analysis, using the term frequency and inverse document frequency (tf-idf) statistics to highlight important terms within documents, and analyzing word networks based on varying n-grams.

Version History

TidyText is currently on version 0.2.6, so there were 15 versions before this (0.10 - 0.2.6). Each version has become increasingly complex and added more functionality to the package.

Usage

Tidytext allows you to apply data wrangling and data visualization methods to text data the same way you would apply them to other data. This is achieved by treating text as data frames of individual words, which allows for easy manipulation, summarization, and visualization of the characteristics of text and it integrates natural language processing (NLP) into the workflow process. There are also sentiment analysis and text mining techniques in this library as well, which will be covered later on.

Dependency to Other Packages

While tidytext has a wide range of functionality, it is dependent on other packages for some of its analysis. The packages it depends on include:

tidyr

As obvious from its name, tidytext is reliant upon tidyr. The optimal goal of tidytext is to convert text into usable ‘tidy’ data that can be manipulated using the traditional functions in tidyverse. It arranges the text into tibbles that can be manipulated and cleaned with tidyverse.

dplyr

This is another package that is used to clean and manipulate the tibbles of text created from the tidytext package. Dplyr has a wide arrange of functions that are helpful, including join functions like the anti_join() and count() functions.

wordcloud

This is a package used for visualizing text alongside the data created from tidytext. The word cloud function allows the R user to create word clouds of their data, which organize text in a cloudlike pattern with the frequency / word counts being represented by a variety of factors such as size, color, etc.

ggplot2

This package is used for elegant data visualization. It layers different components of the visualization to make a beautiful one. Since tidytext makes text fit the criteria tidy data, ggplot2 can be easily used on this new text data to make for appealing visualizations.

This is a package that also offers functions for text mining, specifically importing data, handling corpuses, processing the data, and creating matrices of term-documents. It is used in conjunction with the tidytext package for doing text mining and analysis.

Examples of Usage

Unnest Tokens

The definition of a token from Stanford, “A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.”

We will be using Trump’s remarks about leaving the Walter Reed Medical Center as the text for these functions.

unnest_tokens

Splits a column into tokens using the tokenizers package, splitting the table into one token per row. This function supports non-standard evaluation through the tidyeval framework. This function also has other sibling wrapper functions that works with other specific data formats such as regular expressions and tweets.

trumptibble %>% unnest_tokens(word, text)

# A tibble: 257 x 1
   word   
   <chr>  
 1 i      
 2 just   
 3 left   
 4 walter 
 5 reed   
 6 medical
 7 center 
 8 and    
 9 it’s   
10 really 
# … with 247 more rows

unnest_regex

Special function that wraps the original unnest_tokens function as

unnest_tokens( token = “regex” )

trumptibble %>% unnest_regex(word, text, pattern = "We")

# A tibble: 8 x 1
  word                                                                          
  <chr>                                                                         
1 "i just left walter reed medical center, and it’s really something very speci…
2 " have the best medical equipment. "                                          
3 " have the best medicines all developed recently, and you’re going to beat it…
4 " have the greatest country in the world. "                                   
5 "’re going back. "                                                            
6 "’re going back to work. "                                                    
7 "’re going to be out front. as your leader, i had to do that. i knew there’s …
8 " have the best medicines in the world, and they’re all happened very shortly…

unnest_ngrams

Splits the text by amount of n. n-grams are used to predict the next item in a sequence. Special function that wraps the original unnest_tokens function as

unnest_tokens( token = “ngrams” )

3 words per line

trumptibble %>% unnest_ngrams(word, text, n = 3)

# A tibble: 255 x 1
   word                 
   <chr>                
 1 i just left          
 2 just left walter     
 3 left walter reed     
 4 walter reed medical  
 5 reed medical center  
 6 medical center and   
 7 center and it’s      
 8 and it’s really      
 9 it’s really something
10 really something very
# … with 245 more rows

5 words per line

trumptibble %>% unnest_ngrams(word, text, n = 5)

# A tibble: 253 x 1
   word                              
   <chr>                             
 1 i just left walter reed           
 2 just left walter reed medical     
 3 left walter reed medical center   
 4 walter reed medical center and    
 5 reed medical center and it’s      
 6 medical center and it’s really    
 7 center and it’s really something  
 8 and it’s really something very    
 9 it’s really something very special
10 really something very special the 
# … with 243 more rows

unnest_sentences

Special function that wraps the original unnest_tokens function as

unnest_tokens( token = “sentences” )

trumptibble %>% unnest_sentences(word, text)

# A tibble: 37 x 1
   word                                                                         
   <chr>                                                                        
 1 i just left walter reed medical center, and it’s really something very speci…
 2 the doctors, the nurses, the first responders, and i learned so much about c…
 3 one thing that’s for certain, don’t let it dominate you.                     
 4 don’t be afraid of it.                                                       
 5 you’re going to beat it.                                                     
 6 we have the best medical equipment.                                          
 7 we have the best medicines all developed recently, and you’re going to beat …
 8 i went … i didn’t feel so good.                                              
 9 and two days ago, i could have left two days ago.                            
10 two days ago, i felt great.                                                  
# … with 27 more rows

“Thank you very much” was the only sentence repeated more than once.

unnest_characters

Special function that wraps the original unnest_tokens function as

unnest_tokens( token = “characters” )

trumptibble %>% unnest_characters(word, text)

# A tibble: 1,023 x 1
   word 
   <chr>
 1 i    
 2 j    
 3 u    
 4 s    
 5 t    
 6 l    
 7 e    
 8 f    
 9 t    
10 w    
# … with 1,013 more rows

e, t, and o are the most used characters in this specific speech.

unnest_tweets

Special function that wraps the original unnest_tokens function as

unnest_tokens( token = “tweets” )

Hope Hicks, who has been working so hard without even taking a small break, has just tested positive for Covid 19. Terrible! The First Lady and I are waiting for our test results. In the meantime, we will begin our quarantine process!
— Donald J. Trump (@realDonaldTrump) October 2, 2020

Tonight, @FLOTUS and I tested positive for COVID-19. We will begin our quarantine and recovery process immediately. We will get through this TOGETHER!
— Donald J. Trump (@realDonaldTrump) October 2, 2020

Doctors, Nurses and ALL at the GREAT Walter Reed Medical Center, and others from likewise incredible institutions who have joined them, are AMAZING!!!Tremendous progress has been made over the last 6 months in fighting this PLAGUE. With their help, I am feeling well!
— Donald J. Trump (@realDonaldTrump) October 3, 2020

Stop Words

stop_words

Stop words are words that are extremely common and don’t add a lot of meaning to a document. There are three lexicons: AFINN, BING, and NRC. Some examples of stop words are: the, a, an, it, and also. The snowball and SMART sets are pulled from the tm package.

stop_words %>% count(lexicon, sort=TRUE)

# A tibble: 3 x 2
  lexicon      n
  <chr>    <int>
1 SMART      571
2 onix       404
3 snowball   174

stop_words

# A tibble: 1,149 x 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# … with 1,139 more rows

Sentiment Analysis

We can take a cleaned up corpus and try to analyze the sentiments from the patterns and frequencies of the words. Sentiment Analysis relies on pre-defined lexicons that categorize words according to sentiment. The three most used sentiment lexicon data sets are:

sentiments

From Bing Liu, Finn Årup Nielsen, and Saif Mohammad and Peter Turney respectively:

‘BING’: labels words as either positive or negative

‘AFINN’: gives words a rating from -5 to +5

‘NRC’: categorizes words into human emotions like joy, fear, sadness.

get_sentiments

This function allows us to load these lexicons as dataframes which we can bind into our corpus with functions like anti_join()

# A tibble: 8 x 2
  word     sentiment
  <chr>    <chr>    
1 dominate positive 
2 afraid   negative 
3 dominate positive 
4 danger   negative 
5 led      positive 
6 risk     negative 
7 danger   negative 
8 dominate positive

pic.twitter.com/OxmRcZ5nUZ
— Donald J. Trump (@realDonaldTrump) October 5, 2020

pic.twitter.com/0Bm9W2u1x7
— Donald J. Trump (@realDonaldTrump) October 4, 2020

pic.twitter.com/gvIPuYtTZG
— Donald J. Trump (@realDonaldTrump) October 3, 2020

pic.twitter.com/mljmx2o0G7
— Donald J. Trump (@realDonaldTrump) May 29, 2020

Term Frequency - Inverse Document Frequency

# A tibble: 6 x 7
  Speech word      n total     tf   idf tf_idf
   <dbl> <chr> <int> <int>  <dbl> <dbl>  <dbl>
1      3 the      39   709 0.0550     0      0
2      3 to       31   709 0.0437     0      0
3      3 i        28   709 0.0395     0      0
4      3 and      24   709 0.0339     0      0
5      3 that     17   709 0.0240     0      0
6      1 i        16   257 0.0623     0      0

# A tibble: 6 x 7
  Speech word         n total      tf   idf   tf_idf
   <dbl> <chr>    <int> <int>   <dbl> <dbl>    <dbl>
1      3 won’t        1   709 0.00141 1.10  0.00155 
2      3 working      1   709 0.00141 1.10  0.00155 
3      3 would        1   709 0.00141 0.405 0.000572
4      3 you’ve       1   709 0.00141 1.10  0.00155 
5      3 your         1   709 0.00141 0.405 0.000572
6      3 yourself     1   709 0.00141 1.10  0.00155

Other

parts_of_speech

Parts of speech for English words from the Moby Project by Grady Ward.

parts_of_speech %>% count(pos, sort = TRUE)

# A tibble: 14 x 2
   pos                        n
   <chr>                  <int>
 1 Noun                  104542
 2 Adjective              47719
 3 Verb (transitive)      15723
 4 Adverb                 13234
 5 Verb (usu participle)  11402
 6 Plural                  7764
 7 Verb (intransitive)     4626
 8 <NA>                    2274
 9 Interjection             395
10 Preposition              159
11 Noun Phrase              115
12 Pronoun                  113
13 Definite Article         103
14 Conjunction               90

parts_of_speech

# A tibble: 208,259 x 2
   word    pos      
   <chr>   <chr>    
 1 3-d     Adjective
 2 3-d     Noun     
 3 4-f     Noun     
 4 4-h'er  Noun     
 5 4-h     Adjective
 6 a'      Adjective
 7 a-1     Noun     
 8 a-axis  Noun     
 9 a-bomb  Noun     
10 a-frame Noun     
# … with 208,249 more rows

nma_words

English negators, modals, and adverbs, as a data frame. A few of these entries are two-word phrases instead of single words.

nma_words %>% count(modifier, sort=TRUE)

# A tibble: 3 x 2
  modifier     n
  <chr>    <int>
1 adverb      22
2 negator     15
3 modal        7

nma_words

# A tibble: 44 x 2
   word      modifier
   <chr>     <chr>   
 1 cannot    negator 
 2 could not negator 
 3 did not   negator 
 4 does not  negator 
 5 had no    negator 
 6 have no   negator 
 7 may not   negator 
 8 never     negator 
 9 no        negator 
10 not       negator 
# … with 34 more rows

Similar Packages

quanteda

A fast, flexible, and comprehensive framework for quantitative text analysis in R. Provides functionality for corpus management, creating and manipulating tokens and ngrams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and distances, applying content dictionaries, applying supervised and unsupervised machine learning, visually representing text and text analyses, and more. This is preferred over the tokenizer package as it utilizes multi threaded processing.

text2vec

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

Reflection

In conclusion, tidytext offers simplified but fast access to a few key text analysis tools. These tools enable the user to split text, gather key statistics, and then analyze it based on a sentiment lexicon. The functions in this package are highly compatible with other packages in tidyverse so it can be easily integrated into other projects.

While working with this package, we’ve identified some advantages and disadvantages to doing text analysis using tidytext. First and foremost, this package is dependent on quite a few other packages. To do any kind of worthy analysis, an R user must also have installed tidyverse, dplyr, other text analysis packages such as tm, and plotting packages such as plotly and ggplot. It is not a stand alone package - it is used in conjunction with many other packages. Also, because of the high dimensional nature of text analysis, the tidytext package is limiting in the value of analysis it produces. For example, the book touches upon other theories such as Latent Dirichlet Allocation, but gives no insight on how to produce generative models of such, and this distribution is a huge part of text analysis. However, this is a fairly robust package for the basics of text-analysis, and for an R user looking for a quick and easy to explore text analysis will be satisfied by this package. It’s written textbook is also an excellent resource and goes into much detail about the theories behind the package and examples of code for the package.

Citations

https://humansofdata.atlan.com/2018/07/introduction-tidytext-mining/

https://towardsdatascience.com/r-packages-for-text-analysis-ad8d86684adb

https://www.rdocumentation.org/packages/tidytext/versions/0.2.0

https://www.tidytextmining.com/preface.html

https://cran.r-project.org/web/packages/

https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

https://twitter.com/realDonaldTrump/

Trump Speeches:

Speech 1: https://www.rev.com/blog/transcripts/donald-trump-video-speech-transcript-after-release-from-walter-reed-hospital-october-5

Speech 2: https://www.rev.com/blog/transcripts/donald-trump-remarks-from-walter-reed-hospital-transcript-october-4

Speech 3: https://www.rev.com/blog/transcripts/donald-trump-video-transcript-from-walter-reed-medical-center-with-covid-update

Speech 4: https://www.whitehouse.gov/briefings-statements/remarks-president-trump-actions-china/

Midterm: TidyText Project

Andrew Lin, Ryan Biswas, Ellen Wray

10/08/2020

Package Overview

TidyText

Version History

Usage

Dependency to Other Packages

Examples of Usage

Unnest Tokens

unnest_tokens

unnest_regex

unnest_ngrams

unnest_sentences

unnest_characters

unnest_tweets

Stop Words

Sentiment Analysis

Term Frequency - Inverse Document Frequency

Other

Similar Packages

Reflection

Citations