#Load libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ------------------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## -- Conflicts --------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.3
library(tidyr)
library(dplyr)
library(widyr)
## Warning: package 'widyr' was built under R version 3.6.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.6.3
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(textdata)
## Warning: package 'textdata' was built under R version 3.6.3
library(ggplot2)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.6.3
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(igraph)
## Warning: package 'igraph' was built under R version 3.6.3
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(grid)
library(tm)
## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(Matrix)
## Warning: package 'Matrix' was built under R version 3.6.3
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.6.3
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.6.3
## Package version: 2.0.1
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
## The following object is masked from 'package:igraph':
## 
##     as.igraph
## The following object is masked from 'package:utils':
## 
##     View

#Preface ##Outline

  1. Chapter 1: Outlines the tidy text format and unnest_tokens() function. It also introduces the gutenbergr and janeaustenr packages
  2. Chapter 2: Shows how to perform sentiment analysis datasets
  3. Chapter 3: Describe the tf-idf() statistics. The higher value is, the smaller density is
  4. Chapter 4: Introduce m-grams and how to analyze word networks in text using widyr package and ggraph packages

Here are some chapters below that cover how to convert back and forth between tidy and nontidy formats

  1. Chapter 5: Introduces methods for tidying document-term matrices and Corpus objects from quanteda packages and tm package

  2. Chapter 6: Explores the concept of topic modeling, and uses the tidy() method to interpret and visualize the out from topicmodels package

Practical cases are in real life datasets

  1. Chapter 7: Demonstrate how to apply in Twitter
  2. Chapter 8: Explore metadata from over 32,000 NASA datasets (available in JSON)
  3. Chapter 9: Analyze a dataset of Usenet messages from a diverse set of newgroups (focus on topics like politics, hockey, technology, atheism, and more) to understand patterns across the groups
##Chapter 1

#The Tidy Text Format Tidy data has a specific structure: * Each variable is a column. * Each observation is a row * Each type of observational unit is a table

Tidy text format as being a table with one token per row

##Contrasting Tidy Text with Other Data Structures - As we stated above, we define the tidy text format as being a table with one token per row

String Text can, of course, be stores as strings

Corpus These types of objects typically contain raw things annotated with additional metadata and details

Document-term matrix This is a sparse matrix describing a collection of documents with one row per each document and one column for each term. The value in the matrix is typically word count or tf_idf

##The unnest_tokens function Emily Dickinson wrote some lovely text in her time

text<-c("Because I could not stop for Death-", "He kindly stopped for me-","The Carriage held but just ourselves-", "and Immortality")

library(dplyr)
text_df<- tibble(line=1:4, text=text)
text_df
## # A tibble: 4 x 2
##    line text                                 
##   <int> <chr>                                
## 1     1 Because I could not stop for Death-  
## 2     2 He kindly stopped for me-            
## 3     3 The Carriage held but just ourselves-
## 4     4 and Immortality

A tibble is a modern class of data frame within R, available in the dplyr and tibble packages, that has a convenient print method, will not convert strings to factors, does not use row names. Tibbles are great for use with tidy tools

library(tidytext)

text_df %>% unnest_tokens(word,text)
## # A tibble: 20 x 2
##     line word       
##    <int> <chr>      
##  1     1 because    
##  2     1 i          
##  3     1 could      
##  4     1 not        
##  5     1 stop       
##  6     1 for        
##  7     1 death      
##  8     2 he         
##  9     2 kindly     
## 10     2 stopped    
## 11     2 for        
## 12     2 me         
## 13     3 the        
## 14     3 carriage   
## 15     3 held       
## 16     3 but        
## 17     3 just       
## 18     3 ourselves  
## 19     4 and        
## 20     4 immortality
Image

Image

#Tidying the Works of Jane Austen

library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 3.6.3
library(dplyr)
library(stringr)

original_books<- austen_books()%>% group_by(book)%>% mutate(linenumber=row_number(),chapter=cumsum(str_detect(text,regex("^chapter [\\divxlc]",ignore_case = TRUE))))%>%ungroup()

original_books
## # A tibble: 73,422 x 4
##    text                  book                linenumber chapter
##    <chr>                 <fct>                    <int>   <int>
##  1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
##  2 ""                    Sense & Sensibility          2       0
##  3 by Jane Austen        Sense & Sensibility          3       0
##  4 ""                    Sense & Sensibility          4       0
##  5 (1811)                Sense & Sensibility          5       0
##  6 ""                    Sense & Sensibility          6       0
##  7 ""                    Sense & Sensibility          7       0
##  8 ""                    Sense & Sensibility          8       0
##  9 ""                    Sense & Sensibility          9       0
## 10 CHAPTER 1             Sense & Sensibility         10       1
## # ... with 73,412 more rows
library(tidytext)
tidy_books<- original_books%>% unnest_tokens(word,text)

tidy_books
## # A tibble: 725,055 x 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # ... with 725,045 more rows

This function uses the tokenizers package to separate each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, setences, lines, paragraphs, or separation aroung a regex path-tern

We also want to remove the words from stop words list, which are words not useful for an analysis

data(stop_words)

tidy_books<- tidy_books%>% anti_join(stop_words)
## Joining, by = "word"

We will use count() to find the most common words in all the books as a whole

library(ggplot2)

tidy_books%>% count(word,sort=TRUE)%>%filter(n>600)%>% mutate(word=reorder(word,n))%>% ggplot(aes(word,n))+geom_col()+xlab(NULL)+coord_flip()

#The gutenbergr Package

The gutenbergr package provides access to the public domain works from the Project Gutenber collection

##Word Frequencies A common task in text mining is to look at word frequencies, just like we have done above for Jane Austen’s novels, and to compare differen texts

library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 3.6.3
hgwells<- gutenberg_download(c(35,36,5230,159))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_hgwells<- hgwells%>% unnest_tokens(word,text)%>%anti_join(stop_words)
## Joining, by = "word"
tidy_hgwells%>% count(word,sort=TRUE)
## # A tibble: 11,769 x 2
##    word       n
##    <chr>  <int>
##  1 time     454
##  2 people   302
##  3 door     260
##  4 heard    249
##  5 black    232
##  6 stood    229
##  7 white    222
##  8 hand     218
##  9 kemp     213
## 10 eyes     210
## # ... with 11,759 more rows
bronte<- gutenberg_download(c(1260,768,969,9182,767))

tidy_bronte<- bronte %>% unnest_tokens(word,text)%>% anti_join(stop_words)
## Joining, by = "word"

Now, let’s calcualte the frequency for each word in the workds of Jane Austen, the Bronte sisters, and H.G Wells by binding the data frames together.

library(tidyr)
frequency <-
 bind_rows(
    mutate(tidy_bronte, author ="Bronte_Sister"),
    mutate(tidy_hgwells, author ="H.G_Wells"),
    mutate(tidy_books, author = "Jane_Austen")
) %>% mutate(word = str_extract(word, "[a-z']+")) %>% count(author, word) %>% group_by(author) %>% mutate(proportion =
                                                                                                              n / sum(n)) %>% select(-n) %>% spread(author, proportion) %>% gather(author, proportion, Bronte_Sister:H.G_Wells)

library(scales)
## Warning: package 'scales' was built under R version 3.6.3
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
ggplot(frequency, aes(
  x = proportion,
  y = Jane_Austen,
  color = abs(Jane_Austen - proportion)
)) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(
    alpha = 0.1,
    size = 2.5,
    width = 0.3,
    height = 0.3
  ) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001),
                       low = "darkslategray4",
                       high = "gray75") +
  facet_wrap( ~ author, ncol = 2) +
  theme(legend.position = "none") +
  labs(y = "Jane Austen", x = NULL)
## Warning: Removed 41357 rows containing missing values (geom_point).
## Warning: Removed 41359 rows containing missing values (geom_text).

###Correlation test

cor.test(data = frequency[frequency$author == "Bronte_Sister",],
~ proportion + Jane_Austen)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane_Austen
## t = 119.65, df = 10404, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7527869 0.7689642
## sample estimates:
##       cor 
## 0.7609938
cor.test(data = frequency[frequency$author == "H.G_Wells",],
~ proportion + Jane_Austen)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane_Austen
## t = 36.441, df = 6053, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4032800 0.4445987
## sample estimates:
##       cor 
## 0.4241601

Summary In this chapter, we explored wht we mean by tidy data when it comes to text, and how tidy data principle can be applied to natural language processing. When text is organized in a format with one token per row, tasks like removing stop words or cal-culating word frequencies are natural applications of familiar operations within the tidy tools ecosystem.

##Chapter 2 #Sentiment Analysis With Tidy Data

In the previous chapter we explored in depth what we mean by th tidy text format and showed how this format can be used to approach questions about word frequency

Image

Image

One way to analyze the sentiment of a text is to consider the text as a combanation of its individual words

##The sentiments Dataset As discussed above, there are a variety of methods and

library(tidytext)

sentiments
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

The three general-purpose lexicons are

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
#Sentiment Analysis with Inner Join
library(janeaustenr)
library(dplyr)
library(stringr)

tidy_books<- austen_books()%>%group_by(book)%>% mutate(linenumber=row_number(),chapter=cumsum(str_detect(text, regex("^chapter [\\divxlc]",ignore_case=TRUE))))%>% ungroup()%>% unnest_tokens(word,text)

#Join with sentiments datasets 

nrcjoy<- get_sentiments("nrc")%>% filter(sentiment=="joy")

tidy_books%>% filter(book=="Emma")%>% inner_join(nrcjoy)%>%count(word,sort=TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

We can visualize the emotional in each bunch of pages by just using the total number of positive minus the total number of negative

library(tidyr)

janeaustensentiment<- tidy_books%>% inner_join(get_sentiments("bing"))%>% count(book, index=linenumber %/% 80, sentiment)%>% spread(sentiment,n,fill=0)%>% mutate(sentiment=positive-negative)
## Joining, by = "word"
library(ggplot2)
ggplot(janeaustensentiment,aes(index,sentiment,fill=book))+geom_col(show.legend = FALSE)+facet_wrap(~book,ncol=2,scales="free_x")

##Most Common Positive and Negative Words

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment

bing_word_counts<-tidy_books%>% inner_join(get_sentiments("bing"))%>% count(word,sentiment,sort=TRUE)%>% ungroup()
## Joining, by = "word"
bing_word_counts%>% group_by(sentiment)%>% top_n(10)%>% ungroup()%>%mutate(word=reorder(word,n))%>% ggplot(aes(word,n,fill=sentiment))+geom_col(show.legend = FALSE)+facet_wrap(~sentiment,scales="free_y")+labs(y="Contribution to sentiment",x=NULL)+coord_flip()
## Selecting by n

##Wordclouds

We’ve seen that this tidy text mining approach works well with ggplot2, but having our data in tidy format is useful for other plots as well

For example, consider the wordcloud package, which uses base R graphics. Let’s look at the most common words in Jane Austen’s works as a whole again, but this time as a wordcloud.

library(wordcloud)

tidy_books%>% anti_join(stop_words)%>% count(word)%>%with(wordcloud(word,n,max.words=100))
## Joining, by = "word"

#wordcloud is just a expression, so we need a with function to make the space for this expression work

In other functions, such as comparison.cloud(), you may need to turn the data frame into a matrix with reshape2 package containing acast()

With comparison.cloud(), Wordclouds are divided into 2 parts based on bing sentiment

library(reshape2)

tidy_books%>% inner_join(get_sentiments("bing"))%>% count(word,sentiment,sort=TRUE)%>%acast(word~sentiment,value.var = "n",fill = 0)%>% comparison.cloud(colors=c("gray20","gray80"),max.words = 100)
## Joining, by = "word"

##Looking at Units Beyond Just Words

Instead of just using “word” default, we can use other formats, such as “sentences”, “n-grams”,…

Here are some examples that use the regex to extract the text.

Extract and count the chapter number for each page

austen_chapter<- austen_books()%>% group_by(book)%>% unnest_tokens(chapter,text,token="regex",pattern="Chapter|CHAPTER [\\dIVXLC]")%>% ungroup()

austen_chapter%>% group_by(book)%>% summarise(chapters=n())
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

We can also use the frequency ratio to know what is the most negative chapter in books

bingnegative<- get_sentiments("bing")%>% filter(sentiment=="negative")

wordcounts<- tidy_books%>%group_by(book,chapter)%>%summarise(words=n())

tidy_books%>%semi_join(bingnegative)%>%group_by(book,chapter)%>% summarise(negativewords=n())%>% left_join(wordcounts,by=c("book","chapter"))%>%mutate(ratio=negativewords/words)%>%filter(chapter!=0)%>%top_n(1)%>%ungroup()
## Joining, by = "word"Selecting by ratio
## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

In Chapter 43 of Sense and Sensibility, Marianne is seriously ill, near death; and in Chapter 34 of Pride and Prejudice, Mr. Darcy proposes for the first time (so badly!). Chapter 46 of MansfieldPark is almost the end, when everyone learns of Henry’s scandalous adultery; Chapter 15 of Emma is when horrifying Mr. Elton proposes; and in Chapter 21 of Northanger Abbey, Catherine is deep in her Gothic faux fantasy of murder. Chapter 4 of Persua‐ sion is when the reader gets the full flashback of Anne refusing Captain Wentworth, how sad she was, and what a terrible mistake she realized it to be

##Chapter 3

#Analyzing Word and Document Frequency: tf-idf

A central question in text mining and natural language processing is how to quantify what a document is about.

One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document

There are many frequent words in document, but maynot important , such as the, of, is and so forth. We might take the approach of adding words like these to a list of stop words and removing theme before analysis. But The large difference in frequency may reduce the quality of content.

Another approach is to look at term’s inverse document frequency (idf), which reduce the difference among these

Image

Image

From this formular, we experience that the higher value idf of term is, the less frequent of term is

##Term Frequency in Jane Austen’s Novels Let us start by looking at the published novels of Jane Austen and first examine term frequency, then td-idf

library(dplyr)
library(janeaustenr)
library(tidytext)

book_words<- austen_books()%>% unnest_tokens(word,text)%>% count(book, word,sort=TRUE)%>%ungroup()

total_words<- book_words%>% group_by(book)%>% summarise(total=sum(n))

book_words<- left_join(book_words,total_words)
## Joining, by = "book"

There are some usual words with the high frequency, such as “the”, “and”, “to”, and so forth

We will divide the n for total(n)

library(ggplot2)

ggplot(book_words,aes(n/total,fill=book))+geom_histogram(show.legend = FALSE)+xlim(NA,0.0009)+facet_wrap(~book,ncol=2,scales="free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 896 rows containing non-finite values (stat_bin).
## Warning: Removed 6 rows containing missing values (geom_bar).

##Zipf’s Law

Zipf’s law states that frequency that a word appears is inversely proportional to its rank

freq_by_rank<- book_words%>% group_by(book)%>% mutate(rank=row_number(),`term frequency`=n/total)

freq_by_rank%>%ggplot(aes(rank,`term frequency`, color=book))+geom_line(siz=1.1,alpha=0.8,show.legend = FALSE)+scale_x_log10()+scale_y_log10()
## Warning: Ignoring unknown parameters: siz

##The bind_tf_idf Function

Th idea of tf-idf is to find the important words for the content of each document by decreasing the weight commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.

book_words<- book_words %>% bind_tf_idf(word,book,n)

book_words%>% arrange(desc(tf_idf))%>%mutate(word=factor(word,levels=rev(unique(word))))%>%group_by(book)%>%top_n(15)%>%ungroup()%>%ggplot(aes(word,tf_idf,fill=book))+geom_col(show.legend = FALSE)+labs(x=NULL,y="tf-idf")+facet_wrap(~book,ncol=2,scales="free")+coord_flip()
## Selecting by tf_idf

#This plot show some uncommon words within the top

##A Corpus of Physics Texts

Let us work with another corpus of documents to see twhat terms are important in a different set of works.

library(gutenbergr)
physics<- gutenberg_download(c(37729,14725.13476,5001),meta_fields="author")
## Warning in .f(.x[[i]], ...): Could not download a book at http://
## aleph.gutenberg.org/1/4/7/2/5/./1/3/4/7/14725.13476/14725.13476.zip
## Warning in .f(.x[[i]], ...): Could not download a book at http://
## aleph.gutenberg.org/5/0/0/5001/5001.zip
physics_words<- physics%>% unnest_tokens(word,text)%>% count(author,word,sort=TRUE)%>%ungroup()

plot_physics<- physics_words%>% bind_tf_idf(word,author,n)%>%arrange(desc(tf_idf))%>%mutate(word=factor(word,levels = rev(unique(word))))%>% mutate(author=factor(author, levels = c("Galilei, Galileo",
"Huygens, Christiaan",
"Tesla, Nikola",
"Einstein, Albert")))

plot_physics%>% group_by(author)%>%top_n(15,tf_idf)%>%ungroup()%>%mutate(word=reorder(word,tf_idf))%>%ggplot(aes(word,tf_idf,fill=author))+geom_col(show.legend = FALSE)+labs(x=NULL,y="tf-idf")+facet_wrap(~author,ncol=2,scales="free")+coord_flip()

##Chapter 4

#Relationships Between Words: N-grams and Correlations

So far, we’ve considered words as individual units, and considered their relationships to sentiments or to documents. However, many interesting text analyses are based on the relationships between words, whether examining which words ten to follow others immediately

##Tokenizing by N-gram

We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentences

We do this by adding the token=“ngrams” option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram. When we set n to 2, we are examining pairs of 2 consecutive words, often called “bigrams”:

library(dplyr)
library(tidytext)
library(janeaustenr)

 austen_bigrams<-austen_books()%>% unnest_tokens(bigram,text,token="ngrams",n=2)

##Counting and Filtering N-grams

Our usual tidy tools apply equally well to n-gram analysis. We can examine the most comon bigrams using dplyr’s count():

austen_bigrams%>%count(bigram,sort=TRUE)
## # A tibble: 211,236 x 2
##    bigram       n
##    <chr>    <int>
##  1 of the    3017
##  2 to be     2787
##  3 in the    2368
##  4 it was    1781
##  5 i am      1545
##  6 she had   1472
##  7 of her    1445
##  8 to the    1387
##  9 she was   1377
## 10 had been  1299
## # ... with 211,226 more rows

A lot of the most common bigrams are pars of common (unin-interesting) words, such as “of the” and “to be”, what we call “stop words”

This is useful to separate the bigram into 2 words, then match with stop_words by inner_join

library(tidyr)

bigrams_separated<- austen_bigrams%>%separate(bigram,c("word1","word2"),sep=" ")

bigrams_filtered<- bigrams_separated%>% filter(!word1 %in% stop_words$word)%>%filter(!word2 %in% stop_words$word)

bigram_count<- bigrams_filtered%>%count(word1,word2,sort=TRUE)

bigram_count
## # A tibble: 33,421 x 3
##    word1   word2         n
##    <chr>   <chr>     <int>
##  1 sir     thomas      287
##  2 miss    crawford    215
##  3 captain wentworth   170
##  4 miss    woodhouse   162
##  5 frank   churchill   132
##  6 lady    russell     118
##  7 lady    bertram     114
##  8 sir     walter      113
##  9 miss    fairfax     109
## 10 colonel brandon     108
## # ... with 33,411 more rows

We can see that names (whether first and last or with a salutation) are the mose com-mon pairs in Jane Austen books

Now, It’s time to unite 2 words to make a meaningful bigrams

bigrams_united<- bigrams_filtered%>% unite(bigram,word1,word2,sep=" ")
bigrams_united
## # A tibble: 44,784 x 2
##    book                bigram                  
##    <fct>               <chr>                   
##  1 Sense & Sensibility jane austen             
##  2 Sense & Sensibility austen 1811             
##  3 Sense & Sensibility 1811 chapter            
##  4 Sense & Sensibility chapter 1               
##  5 Sense & Sensibility norland park            
##  6 Sense & Sensibility surrounding acquaintance
##  7 Sense & Sensibility late owner              
##  8 Sense & Sensibility advanced age            
##  9 Sense & Sensibility constant companion      
## 10 Sense & Sensibility happened ten            
## # ... with 44,774 more rows

In other Analyses you may be interested in the ose common trigrams, which are consecutive sequences of three words

austen_books()%>% unnest_tokens(trigram,text,token="ngrams",n=3)%>% separate(trigram,c("word1","word2","word3"),sep=" ")%>% filter(!word1 %in% stop_words$word,!word2 %in% stop_words$word,!word3 %in% stop_words$word)%>%count(word1,word2,word3,sort=TRUE)
## # A tibble: 8,757 x 4
##    word1     word2     word3         n
##    <chr>     <chr>     <chr>     <int>
##  1 dear      miss      woodhouse    23
##  2 miss      de        bourgh       18
##  3 lady      catherine de           14
##  4 catherine de        bourgh       13
##  5 poor      miss      taylor       11
##  6 sir       walter    elliot       11
##  7 ten       thousand  pounds       11
##  8 dear      sir       thomas       10
##  9 twenty    thousand  pounds        8
## 10 replied   miss      crawford      7
## # ... with 8,747 more rows

##Analyzing Bigrams

This one-bigram-per-row format is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most common “streets” mentioned in each book.

bigrams_filtered %>% filter(word2=="street")%>%count(book,word1,sort=TRUE)
## # A tibble: 34 x 3
##    book                word1           n
##    <fct>               <chr>       <int>
##  1 Sense & Sensibility berkeley       16
##  2 Sense & Sensibility harley         16
##  3 Northanger Abbey    pulteney       14
##  4 Northanger Abbey    milsom         11
##  5 Mansfield Park      wimpole        10
##  6 Pride & Prejudice   gracechurch     9
##  7 Sense & Sensibility conduit         6
##  8 Sense & Sensibility bond            5
##  9 Persuasion          milsom          5
## 10 Persuasion          rivers          4
## # ... with 24 more rows
bigram_tf_idf<- bigrams_united %>% count(book,bigram)%>% bind_tf_idf(bigram,book,n)%>% arrange(desc(tf_idf))

Using Bigrams to Provide Context in Sentiment Analysis

Our sentiment analysis approach in Chapter 2 simply counted the appearance of positive or negative words, according to a reference lexicon.

One of the problems with this approach is that a word’s context can matter nearly as much as its presence. For example, the words “happy” and “like” will be counted as positive, even in a sentence like "I'm not happy", "I dont't like it"

bigrams_separated %>% filter(word1=="not")%>%count(word1,word2,sort=TRUE)
## # A tibble: 1,246 x 3
##    word1 word2     n
##    <chr> <chr> <int>
##  1 not   be      610
##  2 not   to      355
##  3 not   have    327
##  4 not   know    252
##  5 not   a       189
##  6 not   think   176
##  7 not   been    160
##  8 not   the     147
##  9 not   at      129
## 10 not   in      118
## # ... with 1,236 more rows

To know more the specific nature, we should use the “afinn” to have a quantited benchmark

AFINN<- get_sentiments("afinn")

not_words<- bigrams_separated%>% filter(word1=="not")%>% inner_join(AFINN,by=c(word2="word"))%>% count(word2,score=value,sort=TRUE)%>% ungroup()

not_words%>% mutate(contribution=n*score)%>%arrange(desc(abs(contribution)))%>% head(20)%>% mutate(word2=reorder(word2,contribution))%>%ggplot(aes(word2,n*score,fill=n*score>0))+geom_col(show.legend = FALSE)+xlab("Words preceded by \"not\"")+ylab("Sentiment score * number of occurences")+coord_flip()

The bigrams “not like” and “not help” were overwhelmingly the largest causes of mis-identification.

So maybe, We can misunderstand the status situation.

Now, let us dive some other aspects, such as “not, no , never, without”

negation_words<- c("not","no","never","without")
negated_words<- bigrams_separated %>% filter(word1 %in% negation_words)%>% inner_join(AFINN, by=c(word2="word"))%>% count(word1,word2,score=value,sort=TRUE)%>% ungroup()

negated_words%>% ggplot(aes(x=word2,y=n*score,fill=n*score))+geom_col()+facet_wrap(~word1)

##Visualizing a Network of Bigrams wih ggraph

We may be interested in visualizing all of the relationships among words simultane-ously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph”. Here we’l be referring to a graph not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has 3 variables:

from

to

weight

library(igraph)

bigram_graph<- bigram_count%>% filter(n>20)%>%graph_from_data_frame()

graph_from_data_frame create the coming from and the going toward points to create the core based platform for adding some other features

library(ggraph)
## Warning: package 'ggraph' was built under R version 3.6.3
set.seed(2017)

ggraph(bigram_graph,layout="fr")+geom_edge_link()+geom_node_point()+geom_node_text(aes(label=name),vjust=1,hjust=1)

We conclude with a fre polishing operations to make a better-looking graph - We add the edge_alpha aesthetic to the link layer to make links transparent based on how common or rare the bigram is - We add directionality with an arrow, constructed using grid::arrow(), including an end_cap option that tells the arrow to end before touching the node - We tinker with the options to the node layer to make the nodes more attractive - We add a theme that’s useful for plotting networks, theme_void()

set.seed(2016)

a<- grid::arrow(type="closed",length=unit(.15,"inches"))

ggraph(bigram_graph,layout="fr")+geom_edge_link(aes(edge_alpha=n),show.legend=FALSE,arrow=a,end_cap=circle(.07,"inches"))+geom_node_point(color="lightblue",size=5)+geom_node_text(aes(label=name),vjust=1,hjust=1)+theme_void()

##Counting and Correlating Pairs of Words with the widyr Package

Tokenizing by n-gram is a useful way to explore pairs of adjacent words. However, we may also be interested in words that tend to co-occur within particular documents or particular chapters, even if they don’t occur next to each other

It helps you can compare the effects between 2 words within 2 scopes (words and other choice (eg: page1:page10,…)

##Counting and Correlating Among Sections

austen_section_words<- austen_books()%>%filter(book=="Pride & Prejudice")%>% mutate(section=row_number()%/%10)%>% filter(section>0)%>%  unnest_tokens(word,text)%>%filter(!word %in% stop_words$word)
library(widyr)
word_pairs<- austen_section_words%>% pairwise_count(word,section,sort=TRUE)

"Choose the word darcy"
## [1] "Choose the word darcy"
word_pairs%>% filter(item1=="dancy")
## # A tibble: 0 x 3
## # ... with 3 variables: item1 <chr>, item2 <chr>, n <dbl>
Image

Image

word_cors<- austen_section_words%>% group_by(word)%>%filter(n()>=20)%>%pairwise_cor(word,section,sort=TRUE)
word_cors
## # A tibble: 154,842 x 3
##    item1     item2     correlation
##    <chr>     <chr>           <dbl>
##  1 bourgh    de              0.951
##  2 de        bourgh          0.951
##  3 pounds    thousand        0.701
##  4 thousand  pounds          0.701
##  5 william   sir             0.664
##  6 sir       william         0.664
##  7 catherine lady            0.663
##  8 lady      catherine       0.663
##  9 forster   colonel         0.622
## 10 colonel   forster         0.622
## # ... with 154,832 more rows
word_cors%>% filter(item1=="pounds")
## # A tibble: 393 x 3
##    item1  item2     correlation
##    <chr>  <chr>           <dbl>
##  1 pounds thousand       0.701 
##  2 pounds ten            0.231 
##  3 pounds fortune        0.164 
##  4 pounds settled        0.149 
##  5 pounds wickham's      0.142 
##  6 pounds children       0.129 
##  7 pounds mother's       0.119 
##  8 pounds believed       0.0932
##  9 pounds estate         0.0890
## 10 pounds ready          0.0860
## # ... with 383 more rows
set.seed(2016)
a<- grid::arrow(type="closed",length=unit(.15,"inches"))

word_cors%>%filter(correlation>.15)%>% graph_from_data_frame()%>% ggraph(layout="fr")+geom_edge_link(aes(edge_alpha=correlation),show.legend=FALSE,arrow=a,end_cap=circle(.07,"inches"))+geom_node_point(color="lightblue",size=5)+geom_node_text(aes(label=name),repel=TRUE)+theme_void()

##Summary

This chapter showed how the tidy text approach is useful not only for analyzing individual words, but also for exploring the relationships and connections between words

##Chapter 5

#Converting to and from Nontidy Formats

In the previous chapters, we’ve been analyzing text arranged in the tidy text format: a table with one token per document per row, such as is constructed by the function unnest_tokens(). This lets us use the popular suuite of tidy tools such as dplyr, tidyr, and ggplot2 to explore and visualize text data

Image ##Tidying a Document-Term matrix

In tidy format - Each row resprent each document

One of the most common structures that text mining packages work with is the documnt-term-matrix or DTM). This is a matrix where:

Since most pairings of document and term do not occur (they have the value zero), DTMS are usually implemented as sparse matrices. These objects can be treated as though they were matrices (for example, accessing particular rows and columns)

DTM objects cannot be used directly with tidy tools, just as tidy data frames cannot be used as input for most text mining packages. Thus, the tidytext package provides 2 verbs that convert between the 2 formats:

##Tidying DocumentTermMatrix Objects

Perhaps the most widely used implementation of DTMs in R is the DocumentTermMatrix class in the tm package

library(tm)
data("AssociatedPress")

AssociatedPress
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)
"To extract the term from DTM. Terms function is used for this case"
## [1] "To extract the term from DTM. Terms function is used for this case"
terms<- Terms(AssociatedPress)

"This shows in matrix format"
## [1] "This shows in matrix format"
head(terms)
## [1] "aaron"      "abandon"    "abandoned"  "abandoning" "abbott"    
## [6] "abboud"
library(dplyr)
library(tidytext)

ap_td<- tidy(AssociatedPress)
ap_td 
## # A tibble: 302,031 x 3
##    document term       count
##       <int> <chr>      <dbl>
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # ... with 302,021 more rows
"Notice that only the nonzero values are included in the tidied out‐
put: document 1 includes terms such as “adding” and “adult,” but
not aaron” or abandon.” This means the tidied version has no
rows where count is zero."
## [1] "Notice that only the nonzero values are included in the tidied out-\nput: document 1 includes terms such as “adding” and “adult,” but\nnot aaron” or abandon.” This means the tidied version has no\nrows where count is zero."
ap_sentiments<- ap_td%>%inner_join(get_sentiments("bing"),by=c(term="word"))

ap_sentiments
## # A tibble: 30,094 x 4
##    document term    count sentiment
##       <int> <chr>   <dbl> <chr>    
##  1        1 assault     1 negative 
##  2        1 complex     1 negative 
##  3        1 death       1 negative 
##  4        1 died        1 negative 
##  5        1 good        2 positive 
##  6        1 illness     1 negative 
##  7        1 killed      2 negative 
##  8        1 like        2 positive 
##  9        1 liked       1 positive 
## 10        1 miracle     1 positive 
## # ... with 30,084 more rows
library(ggplot2)

ap_sentiments%>% count(sentiment,term,wt=count)%>%ungroup()%>%filter(n>=200)%>%mutate(n>=200)%>% mutate(n=ifelse(sentiment=="negative",-n,n))%>%mutate(term=reorder(term,n))%>%ggplot(aes(term,n,fill=sentiment))+geom_bar(stat="identity")+ylab("Contribution to sentiment")+coord_flip()

##Tidying dfm Objects

Other text mining packages provide alternative implementations of DTM, such as dfm (document-feature matrix) class from quanteda package

library(methods)
data("data_corpus_inaugural",package="quanteda")
inaug_dfm<- quanteda::dfm(data_corpus_inaugural,verbose=FALSE)

inaug_td<-tidy(inaug_dfm)

##Casting Tidy Text Data into a Matrix

Just as some existing text mining packages provide document term matrices as sample data or output, some algorithm expect such matrices as input. Therefore, tidytext provides cast_verbs for converting from a tidy form to these matrices

ap_td%>%cast_dtm(document,term,count)
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)
library(Matrix)
m<- ap_td%>% cast_sparse(document,term,count)

This kind of conversion could easily be done from any of tidy text structures we’ve used so far in this book. For example, we could create a DTM of Jane Austen’s books in just a few lines of code

library(janeaustenr)
austen_dtm<- austen_books()%>%unnest_tokens(word,text)%>% count(book,word)%>%cast_dtm(book,word,n)

austen_dtm
## <<DocumentTermMatrix (documents: 6, terms: 14520)>>
## Non-/sparse entries: 40379/46741
## Sparsity           : 54%
## Maximal term length: 19
## Weighting          : term frequency (tf)
##Chapter 6

#Topic Modeling

In text mining, we often have collections of documents, such as blog post or news articles, that we’d like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It reats each document as a mixture of topics, and each topic as a mixture of words

Image

Image

##Latent Dirichlet Allocation

Laten Dirichlet allocation is one of the most common algorithms for topic modelling.

library(topicmodels)
data("AssociatedPress")

AssociatedPress
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

We can use the LDA() function from topicmodels packge, setting k=3, to create 2 topic LDA model

ap_lda<- LDA(AssociatedPress,k=2,control=list(seed=1234))
ap_lda
## A LDA_VEM topic model with 2 topics.

##Word-Topic Probabilities

In Chapter 5 we introduced the tidy() method, originally from broom package, for tidying model objcts. The tidytext package provides this method for extracting the per-topic-per-word probabilities, call (“beta”) from the model

library(tidytext)

ap_topics<-tidy(ap_lda,matrix="beta")
ap_topics
## # A tibble: 20,946 x 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 aaron      1.69e-12
##  2     2 aaron      3.90e- 5
##  3     1 abandon    2.65e- 5
##  4     2 abandon    3.99e- 5
##  5     1 abandoned  1.39e- 4
##  6     2 abandoned  5.88e- 5
##  7     1 abandoning 2.45e-33
##  8     2 abandoning 2.34e- 5
##  9     1 abbott     2.13e- 6
## 10     2 abbott     2.97e- 5
## # ... with 20,936 more rows

Notice that this has turned the model into a one-topic-per-term-per-row format. Beta is the probability of that term being generated from that topic

library(ggplot2)
library(dplyr)

ap_top_terms<- ap_topics%>%group_by(topic)%>% top_n(10,beta)%>%ungroup()%>%arrange(topic,-beta)

ap_top_terms%>% mutate(term=reorder(term,beta))%>%ggplot(aes(term,beta,fill=factor(topic)))+geom_col(show.legend=FALSE)+facet_wrap(~topic,scales="free")+coord_flip()

This visualization lets us understand the two topics that were extracted from the arti‐ cles. The most common words in topic 1 include “percent,” “million,” “billion,” and “company,” which suggests it may represent business or financial news. Those most common in topic 2 include “president,” “government,” and “soviet,” suggeting that this topic represents political news

One important observation about the words in each topic is that some words, such as “new” and “people,” are common within both topics.

This is an advantage of topic modeling as opposed to “hard clustering” methods: top‐ ics used in natural language could have some overlap in terms of words. As an alternative, we could consider the terms that had the greatest difference in β between topic 1 and topic 2. This can be estimated based on the log ratio of the two: log2(B2/B1)

library(tidyr)
beta_spread<- ap_topics%>% mutate(topic=paste0("topic",topic))%>%spread(topic,beta)%>%filter(topic1>.001|topic2 > .001)%>%mutate(log_ratio=log2(topic2/topic1))

beta_spread
## # A tibble: 198 x 4
##    term              topic1      topic2 log_ratio
##    <chr>              <dbl>       <dbl>     <dbl>
##  1 administration 0.000431  0.00138         1.68 
##  2 ago            0.00107   0.000842       -0.339
##  3 agreement      0.000671  0.00104         0.630
##  4 aid            0.0000476 0.00105         4.46 
##  5 air            0.00214   0.000297       -2.85 
##  6 american       0.00203   0.00168        -0.270
##  7 analysts       0.00109   0.000000578   -10.9  
##  8 area           0.00137   0.000231       -2.57 
##  9 army           0.000262  0.00105         2.00 
## 10 asked          0.000189  0.00156         3.05 
## # ... with 188 more rows

##Document-Topic Probabilities

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics

ap_documents<-tidy(ap_lda,matrix="gamma")
ap_documents
## # A tibble: 4,492 x 3
##    document topic    gamma
##       <int> <int>    <dbl>
##  1        1     1 0.248   
##  2        2     1 0.362   
##  3        3     1 0.527   
##  4        4     1 0.357   
##  5        5     1 0.181   
##  6        6     1 0.000588
##  7        7     1 0.773   
##  8        8     1 0.00445 
##  9        9     1 0.967   
## 10       10     1 0.147   
## # ... with 4,482 more rows

Document 6 is quite high relevant to topic 2

##By-Word Assignments:augement

One step of LDA algorithm is assigning each word in each document to a topic

##Summary

This chapter introduced topic modeling for finding clusters of words that characterize a set of documents, and shoed hoe the tidy() verb lets us explore and understand these models using dplyr and ggplot2