TidyText was created for the process of text mining, used especially within tidyverse, to be able to analyze and visualize text. The easy manipulation of text is essential to the process of text mining and interpreting natural language processing. This package allows the R user to manipulate text as a user would manipulate any kind of traditional data. This package delves into tidying text using unnesting functions, performing sentiment analysis, using the term frequency and inverse document frequency (tf-idf) statistics to highlight important terms within documents, and analyzing word networks based on varying n-grams.
TidyText is currently on version 0.2.6, so there were 15 versions before this (0.10 - 0.2.6). Each version has become increasingly complex and added more functionality to the package.
Tidytext allows you to apply data wrangling and data visualization methods to text data the same way you would apply them to other data. This is achieved by treating text as data frames of individual words, which allows for easy manipulation, summarization, and visualization of the characteristics of text and it integrates natural language processing (NLP) into the workflow process. There are also sentiment analysis and text mining techniques in this library as well, which will be covered later on.
While tidytext has a wide range of functionality, it is dependent on other packages for some of its analysis. The packages it depends on include:
tidyr
As obvious from its name, tidytext is reliant upon tidyr. The optimal goal of tidytext is to convert text into usable ‘tidy’ data that can be manipulated using the traditional functions in tidyverse. It arranges the text into tibbles that can be manipulated and cleaned with tidyverse.
dplyr
This is another package that is used to clean and manipulate the tibbles of text created from the tidytext package. Dplyr has a wide arrange of functions that are helpful, including join functions like the anti_join() and count() functions.
wordcloud
This is a package used for visualizing text alongside the data created from tidytext. The word cloud function allows the R user to create word clouds of their data, which organize text in a cloudlike pattern with the frequency / word counts being represented by a variety of factors such as size, color, etc.
ggplot2
This package is used for elegant data visualization. It layers different components of the visualization to make a beautiful one. Since tidytext makes text fit the criteria tidy data, ggplot2 can be easily used on this new text data to make for appealing visualizations.
tm
This is a package that also offers functions for text mining, specifically importing data, handling corpuses, processing the data, and creating matrices of term-documents. It is used in conjunction with the tidytext package for doing text mining and analysis.
The definition of a token from Stanford, “A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.”
We will be using Trump’s remarks about leaving the Walter Reed Medical Center as the text for these functions.
Splits a column into tokens using the tokenizers package, splitting the table into one token per row. This function supports non-standard evaluation through the tidyeval framework. This function also has other sibling wrapper functions that works with other specific data formats such as regular expressions and tweets.
trumptibble %>% unnest_tokens(word, text)
# A tibble: 257 x 1
word
<chr>
1 i
2 just
3 left
4 walter
5 reed
6 medical
7 center
8 and
9 it’s
10 really
# … with 247 more rows
Special function that wraps the original unnest_tokens function as
unnest_tokens( token = “regex” )
trumptibble %>% unnest_regex(word, text, pattern = "We")
# A tibble: 8 x 1
word
<chr>
1 "i just left walter reed medical center, and it’s really something very speci…
2 " have the best medical equipment. "
3 " have the best medicines all developed recently, and you’re going to beat it…
4 " have the greatest country in the world. "
5 "’re going back. "
6 "’re going back to work. "
7 "’re going to be out front. as your leader, i had to do that. i knew there’s …
8 " have the best medicines in the world, and they’re all happened very shortly…
Splits the text by amount of n. n-grams are used to predict the next item in a sequence. Special function that wraps the original unnest_tokens function as
unnest_tokens( token = “ngrams” )
3 words per line
trumptibble %>% unnest_ngrams(word, text, n = 3)
# A tibble: 255 x 1
word
<chr>
1 i just left
2 just left walter
3 left walter reed
4 walter reed medical
5 reed medical center
6 medical center and
7 center and it’s
8 and it’s really
9 it’s really something
10 really something very
# … with 245 more rows
5 words per line
trumptibble %>% unnest_ngrams(word, text, n = 5)
# A tibble: 253 x 1
word
<chr>
1 i just left walter reed
2 just left walter reed medical
3 left walter reed medical center
4 walter reed medical center and
5 reed medical center and it’s
6 medical center and it’s really
7 center and it’s really something
8 and it’s really something very
9 it’s really something very special
10 really something very special the
# … with 243 more rows
Special function that wraps the original unnest_tokens function as
unnest_tokens( token = “sentences” )
trumptibble %>% unnest_sentences(word, text)
# A tibble: 37 x 1
word
<chr>
1 i just left walter reed medical center, and it’s really something very speci…
2 the doctors, the nurses, the first responders, and i learned so much about c…
3 one thing that’s for certain, don’t let it dominate you.
4 don’t be afraid of it.
5 you’re going to beat it.
6 we have the best medical equipment.
7 we have the best medicines all developed recently, and you’re going to beat …
8 i went … i didn’t feel so good.
9 and two days ago, i could have left two days ago.
10 two days ago, i felt great.
# … with 27 more rows
“Thank you very much” was the only sentence repeated more than once.
Special function that wraps the original unnest_tokens function as
unnest_tokens( token = “characters” )
trumptibble %>% unnest_characters(word, text)
# A tibble: 1,023 x 1
word
<chr>
1 i
2 j
3 u
4 s
5 t
6 l
7 e
8 f
9 t
10 w
# … with 1,013 more rows
e, t, and o are the most used characters in this specific speech.
Special function that wraps the original unnest_tokens function as
unnest_tokens( token = “tweets” )
Hope Hicks, who has been working so hard without even taking a small break, has just tested positive for Covid 19. Terrible! The First Lady and I are waiting for our test results. In the meantime, we will begin our quarantine process!
— Donald J. Trump (@realDonaldTrump) October 2, 2020
Tonight, @FLOTUS and I tested positive for COVID-19. We will begin our quarantine and recovery process immediately. We will get through this TOGETHER!
— Donald J. Trump (@realDonaldTrump) October 2, 2020
Doctors, Nurses and ALL at the GREAT Walter Reed Medical Center, and others from likewise incredible institutions who have joined them, are AMAZING!!!Tremendous progress has been made over the last 6 months in fighting this PLAGUE. With their help, I am feeling well!
— Donald J. Trump (@realDonaldTrump) October 3, 2020
stop_words
Stop words are words that are extremely common and don’t add a lot of meaning to a document. There are three lexicons: AFINN, BING, and NRC. Some examples of stop words are: the, a, an, it, and also. The snowball and SMART sets are pulled from the tm package.
stop_words %>% count(lexicon, sort=TRUE)
# A tibble: 3 x 2
lexicon n
<chr> <int>
1 SMART 571
2 onix 404
3 snowball 174
stop_words
# A tibble: 1,149 x 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# … with 1,139 more rows
We can take a cleaned up corpus and try to analyze the sentiments from the patterns and frequencies of the words. Sentiment Analysis relies on pre-defined lexicons that categorize words according to sentiment. The three most used sentiment lexicon data sets are:
sentiments
From Bing Liu, Finn Årup Nielsen, and Saif Mohammad and Peter Turney respectively:
‘BING’: labels words as either positive or negative
‘AFINN’: gives words a rating from -5 to +5
‘NRC’: categorizes words into human emotions like joy, fear, sadness.
get_sentiments
This function allows us to load these lexicons as dataframes which we can bind into our corpus with functions like anti_join()
# A tibble: 8 x 2
word sentiment
<chr> <chr>
1 dominate positive
2 afraid negative
3 dominate positive
4 danger negative
5 led positive
6 risk negative
7 danger negative
8 dominate positive
— Donald J. Trump (@realDonaldTrump) October 5, 2020
— Donald J. Trump (@realDonaldTrump) October 4, 2020
— Donald J. Trump (@realDonaldTrump) October 3, 2020
— Donald J. Trump (@realDonaldTrump) May 29, 2020
# A tibble: 6 x 7
Speech word n total tf idf tf_idf
<dbl> <chr> <int> <int> <dbl> <dbl> <dbl>
1 3 the 39 709 0.0550 0 0
2 3 to 31 709 0.0437 0 0
3 3 i 28 709 0.0395 0 0
4 3 and 24 709 0.0339 0 0
5 3 that 17 709 0.0240 0 0
6 1 i 16 257 0.0623 0 0
# A tibble: 6 x 7
Speech word n total tf idf tf_idf
<dbl> <chr> <int> <int> <dbl> <dbl> <dbl>
1 3 won’t 1 709 0.00141 1.10 0.00155
2 3 working 1 709 0.00141 1.10 0.00155
3 3 would 1 709 0.00141 0.405 0.000572
4 3 you’ve 1 709 0.00141 1.10 0.00155
5 3 your 1 709 0.00141 0.405 0.000572
6 3 yourself 1 709 0.00141 1.10 0.00155
parts_of_speech
Parts of speech for English words from the Moby Project by Grady Ward.
parts_of_speech %>% count(pos, sort = TRUE)
# A tibble: 14 x 2
pos n
<chr> <int>
1 Noun 104542
2 Adjective 47719
3 Verb (transitive) 15723
4 Adverb 13234
5 Verb (usu participle) 11402
6 Plural 7764
7 Verb (intransitive) 4626
8 <NA> 2274
9 Interjection 395
10 Preposition 159
11 Noun Phrase 115
12 Pronoun 113
13 Definite Article 103
14 Conjunction 90
parts_of_speech
# A tibble: 208,259 x 2
word pos
<chr> <chr>
1 3-d Adjective
2 3-d Noun
3 4-f Noun
4 4-h'er Noun
5 4-h Adjective
6 a' Adjective
7 a-1 Noun
8 a-axis Noun
9 a-bomb Noun
10 a-frame Noun
# … with 208,249 more rows
nma_words
English negators, modals, and adverbs, as a data frame. A few of these entries are two-word phrases instead of single words.
nma_words %>% count(modifier, sort=TRUE)
# A tibble: 3 x 2
modifier n
<chr> <int>
1 adverb 22
2 negator 15
3 modal 7
nma_words
# A tibble: 44 x 2
word modifier
<chr> <chr>
1 cannot negator
2 could not negator
3 did not negator
4 does not negator
5 had no negator
6 have no negator
7 may not negator
8 never negator
9 no negator
10 not negator
# … with 34 more rows
quanteda
A fast, flexible, and comprehensive framework for quantitative text analysis in R. Provides functionality for corpus management, creating and manipulating tokens and ngrams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and distances, applying content dictionaries, applying supervised and unsupervised machine learning, visually representing text and text analyses, and more. This is preferred over the tokenizer package as it utilizes multi threaded processing.
text2vec
Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.
In conclusion, tidytext offers simplified but fast access to a few key text analysis tools. These tools enable the user to split text, gather key statistics, and then analyze it based on a sentiment lexicon. The functions in this package are highly compatible with other packages in tidyverse so it can be easily integrated into other projects.
While working with this package, we’ve identified some advantages and disadvantages to doing text analysis using tidytext. First and foremost, this package is dependent on quite a few other packages. To do any kind of worthy analysis, an R user must also have installed tidyverse, dplyr, other text analysis packages such as tm, and plotting packages such as plotly and ggplot. It is not a stand alone package - it is used in conjunction with many other packages. Also, because of the high dimensional nature of text analysis, the tidytext package is limiting in the value of analysis it produces. For example, the book touches upon other theories such as Latent Dirichlet Allocation, but gives no insight on how to produce generative models of such, and this distribution is a huge part of text analysis. However, this is a fairly robust package for the basics of text-analysis, and for an R user looking for a quick and easy to explore text analysis will be satisfied by this package. It’s written textbook is also an excellent resource and goes into much detail about the theories behind the package and examples of code for the package.
https://humansofdata.atlan.com/2018/07/introduction-tidytext-mining/
https://towardsdatascience.com/r-packages-for-text-analysis-ad8d86684adb
https://www.rdocumentation.org/packages/tidytext/versions/0.2.0
https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html
Trump Speeches:
Speech 4: https://www.whitehouse.gov/briefings-statements/remarks-president-trump-actions-china/