#Load libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ------------------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## -- Conflicts --------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.3
library(tidyr)
library(dplyr)
library(widyr)
## Warning: package 'widyr' was built under R version 3.6.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.6.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(textdata)
## Warning: package 'textdata' was built under R version 3.6.3
library(ggplot2)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.6.3
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(igraph)
## Warning: package 'igraph' was built under R version 3.6.3
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(grid)
library(tm)
## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(Matrix)
## Warning: package 'Matrix' was built under R version 3.6.3
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.6.3
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.6.3
## Package version: 2.0.1
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
## The following object is masked from 'package:igraph':
##
## as.igraph
## The following object is masked from 'package:utils':
##
## View
#Preface ##Outline
unnest_tokens() function. It also introduces the gutenbergr and janeaustenr packagessentiment analysis datasetstf-idf() statistics. The higher value is, the smaller density ism-grams and how to analyze word networks in text using widyr package and ggraph packagesHere are some chapters below that cover how to convert back and forth between tidy and nontidy formats
Chapter 5: Introduces methods for tidying document-term matrices and Corpus objects from quanteda packages and tm package
Chapter 6: Explores the concept of topic modeling, and uses the tidy() method to interpret and visualize the out from topicmodels package
Practical cases are in real life datasets
JSON)Usenet messages from a diverse set of newgroups (focus on topics like politics, hockey, technology, atheism, and more) to understand patterns across the groups#The Tidy Text Format Tidy data has a specific structure: * Each variable is a column. * Each observation is a row * Each type of observational unit is a table
Tidy text format as being a table with one token per row
##Contrasting Tidy Text with Other Data Structures - As we stated above, we define the tidy text format as being a table with one token per row
String Text can, of course, be stores as strings
Corpus These types of objects typically contain raw things annotated with additional metadata and details
Document-term matrix This is a sparse matrix describing a collection of documents with one row per each document and one column for each term. The value in the matrix is typically word count or tf_idf
##The unnest_tokens function Emily Dickinson wrote some lovely text in her time
text<-c("Because I could not stop for Death-", "He kindly stopped for me-","The Carriage held but just ourselves-", "and Immortality")
library(dplyr)
text_df<- tibble(line=1:4, text=text)
text_df
## # A tibble: 4 x 2
## line text
## <int> <chr>
## 1 1 Because I could not stop for Death-
## 2 2 He kindly stopped for me-
## 3 3 The Carriage held but just ourselves-
## 4 4 and Immortality
A tibble is a modern class of data frame within R, available in the dplyr and tibble packages, that has a convenient print method, will not convert strings to factors, does not use row names. Tibbles are great for use with tidy tools
library(tidytext)
text_df %>% unnest_tokens(word,text)
## # A tibble: 20 x 2
## line word
## <int> <chr>
## 1 1 because
## 2 1 i
## 3 1 could
## 4 1 not
## 5 1 stop
## 6 1 for
## 7 1 death
## 8 2 he
## 9 2 kindly
## 10 2 stopped
## 11 2 for
## 12 2 me
## 13 3 the
## 14 3 carriage
## 15 3 held
## 16 3 but
## 17 3 just
## 18 3 ourselves
## 19 4 and
## 20 4 immortality
Image
#Tidying the Works of Jane Austen
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 3.6.3
library(dplyr)
library(stringr)
original_books<- austen_books()%>% group_by(book)%>% mutate(linenumber=row_number(),chapter=cumsum(str_detect(text,regex("^chapter [\\divxlc]",ignore_case = TRUE))))%>%ungroup()
original_books
## # A tibble: 73,422 x 4
## text book linenumber chapter
## <chr> <fct> <int> <int>
## 1 SENSE AND SENSIBILITY Sense & Sensibility 1 0
## 2 "" Sense & Sensibility 2 0
## 3 by Jane Austen Sense & Sensibility 3 0
## 4 "" Sense & Sensibility 4 0
## 5 (1811) Sense & Sensibility 5 0
## 6 "" Sense & Sensibility 6 0
## 7 "" Sense & Sensibility 7 0
## 8 "" Sense & Sensibility 8 0
## 9 "" Sense & Sensibility 9 0
## 10 CHAPTER 1 Sense & Sensibility 10 1
## # ... with 73,412 more rows
library(tidytext)
tidy_books<- original_books%>% unnest_tokens(word,text)
tidy_books
## # A tibble: 725,055 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
## 7 Sense & Sensibility 5 0 1811
## 8 Sense & Sensibility 10 1 chapter
## 9 Sense & Sensibility 10 1 1
## 10 Sense & Sensibility 13 1 the
## # ... with 725,045 more rows
This function uses the tokenizers package to separate each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, setences, lines, paragraphs, or separation aroung a regex path-tern
We also want to remove the words from stop words list, which are words not useful for an analysis
data(stop_words)
tidy_books<- tidy_books%>% anti_join(stop_words)
## Joining, by = "word"
We will use count() to find the most common words in all the books as a whole
library(ggplot2)
tidy_books%>% count(word,sort=TRUE)%>%filter(n>600)%>% mutate(word=reorder(word,n))%>% ggplot(aes(word,n))+geom_col()+xlab(NULL)+coord_flip()
#The gutenbergr Package
The gutenbergr package provides access to the public domain works from the Project Gutenber collection
##Word Frequencies A common task in text mining is to look at word frequencies, just like we have done above for Jane Austen’s novels, and to compare differen texts
library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 3.6.3
hgwells<- gutenberg_download(c(35,36,5230,159))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_hgwells<- hgwells%>% unnest_tokens(word,text)%>%anti_join(stop_words)
## Joining, by = "word"
tidy_hgwells%>% count(word,sort=TRUE)
## # A tibble: 11,769 x 2
## word n
## <chr> <int>
## 1 time 454
## 2 people 302
## 3 door 260
## 4 heard 249
## 5 black 232
## 6 stood 229
## 7 white 222
## 8 hand 218
## 9 kemp 213
## 10 eyes 210
## # ... with 11,759 more rows
bronte<- gutenberg_download(c(1260,768,969,9182,767))
tidy_bronte<- bronte %>% unnest_tokens(word,text)%>% anti_join(stop_words)
## Joining, by = "word"
Now, let’s calcualte the frequency for each word in the workds of Jane Austen, the Bronte sisters, and H.G Wells by binding the data frames together.
library(tidyr)
frequency <-
bind_rows(
mutate(tidy_bronte, author ="Bronte_Sister"),
mutate(tidy_hgwells, author ="H.G_Wells"),
mutate(tidy_books, author = "Jane_Austen")
) %>% mutate(word = str_extract(word, "[a-z']+")) %>% count(author, word) %>% group_by(author) %>% mutate(proportion =
n / sum(n)) %>% select(-n) %>% spread(author, proportion) %>% gather(author, proportion, Bronte_Sister:H.G_Wells)
library(scales)
## Warning: package 'scales' was built under R version 3.6.3
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
ggplot(frequency, aes(
x = proportion,
y = Jane_Austen,
color = abs(Jane_Austen - proportion)
)) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(
alpha = 0.1,
size = 2.5,
width = 0.3,
height = 0.3
) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001),
low = "darkslategray4",
high = "gray75") +
facet_wrap( ~ author, ncol = 2) +
theme(legend.position = "none") +
labs(y = "Jane Austen", x = NULL)
## Warning: Removed 41357 rows containing missing values (geom_point).
## Warning: Removed 41359 rows containing missing values (geom_text).
###Correlation test
cor.test(data = frequency[frequency$author == "Bronte_Sister",],
~ proportion + Jane_Austen)
##
## Pearson's product-moment correlation
##
## data: proportion and Jane_Austen
## t = 119.65, df = 10404, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7527869 0.7689642
## sample estimates:
## cor
## 0.7609938
cor.test(data = frequency[frequency$author == "H.G_Wells",],
~ proportion + Jane_Austen)
##
## Pearson's product-moment correlation
##
## data: proportion and Jane_Austen
## t = 36.441, df = 6053, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4032800 0.4445987
## sample estimates:
## cor
## 0.4241601
Summary In this chapter, we explored wht we mean by tidy data when it comes to text, and how tidy data principle can be applied to natural language processing. When text is organized in a format with one token per row, tasks like removing stop words or cal-culating word frequencies are natural applications of familiar operations within the tidy tools ecosystem.
##Chapter 2 #Sentiment Analysis With Tidy Data
In the previous chapter we explored in depth what we mean by th tidy text format and showed how this format can be used to approach questions about word frequency
Image
One way to analyze the sentiment of a text is to consider the text as a combanation of its individual words
##The sentiments Dataset As discussed above, there are a variety of methods and
library(tidytext)
sentiments
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
The three general-purpose lexicons are
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
#Sentiment Analysis with Inner Join
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books<- austen_books()%>%group_by(book)%>% mutate(linenumber=row_number(),chapter=cumsum(str_detect(text, regex("^chapter [\\divxlc]",ignore_case=TRUE))))%>% ungroup()%>% unnest_tokens(word,text)
#Join with sentiments datasets
nrcjoy<- get_sentiments("nrc")%>% filter(sentiment=="joy")
tidy_books%>% filter(book=="Emma")%>% inner_join(nrcjoy)%>%count(word,sort=TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
We can visualize the emotional in each bunch of pages by just using the total number of positive minus the total number of negative
library(tidyr)
janeaustensentiment<- tidy_books%>% inner_join(get_sentiments("bing"))%>% count(book, index=linenumber %/% 80, sentiment)%>% spread(sentiment,n,fill=0)%>% mutate(sentiment=positive-negative)
## Joining, by = "word"
library(ggplot2)
ggplot(janeaustensentiment,aes(index,sentiment,fill=book))+geom_col(show.legend = FALSE)+facet_wrap(~book,ncol=2,scales="free_x")
##Most Common Positive and Negative Words
One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment
bing_word_counts<-tidy_books%>% inner_join(get_sentiments("bing"))%>% count(word,sentiment,sort=TRUE)%>% ungroup()
## Joining, by = "word"
bing_word_counts%>% group_by(sentiment)%>% top_n(10)%>% ungroup()%>%mutate(word=reorder(word,n))%>% ggplot(aes(word,n,fill=sentiment))+geom_col(show.legend = FALSE)+facet_wrap(~sentiment,scales="free_y")+labs(y="Contribution to sentiment",x=NULL)+coord_flip()
## Selecting by n
##Wordclouds
We’ve seen that this tidy text mining approach works well with ggplot2, but having our data in tidy format is useful for other plots as well
For example, consider the wordcloud package, which uses base R graphics. Let’s look at the most common words in Jane Austen’s works as a whole again, but this time as a wordcloud.
library(wordcloud)
tidy_books%>% anti_join(stop_words)%>% count(word)%>%with(wordcloud(word,n,max.words=100))
## Joining, by = "word"
#wordcloud is just a expression, so we need a with function to make the space for this expression work
In other functions, such as comparison.cloud(), you may need to turn the data frame into a matrix with reshape2 package containing acast()
With comparison.cloud(), Wordclouds are divided into 2 parts based on bing sentiment
library(reshape2)
tidy_books%>% inner_join(get_sentiments("bing"))%>% count(word,sentiment,sort=TRUE)%>%acast(word~sentiment,value.var = "n",fill = 0)%>% comparison.cloud(colors=c("gray20","gray80"),max.words = 100)
## Joining, by = "word"
##Looking at Units Beyond Just Words
Instead of just using “word” default, we can use other formats, such as “sentences”, “n-grams”,…
Here are some examples that use the regex to extract the text.
Extract and count the chapter number for each page
austen_chapter<- austen_books()%>% group_by(book)%>% unnest_tokens(chapter,text,token="regex",pattern="Chapter|CHAPTER [\\dIVXLC]")%>% ungroup()
austen_chapter%>% group_by(book)%>% summarise(chapters=n())
## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
We can also use the frequency ratio to know what is the most negative chapter in books
bingnegative<- get_sentiments("bing")%>% filter(sentiment=="negative")
wordcounts<- tidy_books%>%group_by(book,chapter)%>%summarise(words=n())
tidy_books%>%semi_join(bingnegative)%>%group_by(book,chapter)%>% summarise(negativewords=n())%>% left_join(wordcounts,by=c("book","chapter"))%>%mutate(ratio=negativewords/words)%>%filter(chapter!=0)%>%top_n(1)%>%ungroup()
## Joining, by = "word"Selecting by ratio
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
In Chapter 43 of Sense and Sensibility, Marianne is seriously ill, near death; and in Chapter 34 of Pride and Prejudice, Mr. Darcy proposes for the first time (so badly!). Chapter 46 of MansfieldPark is almost the end, when everyone learns of Henry’s scandalous adultery; Chapter 15 of Emma is when horrifying Mr. Elton proposes; and in Chapter 21 of Northanger Abbey, Catherine is deep in her Gothic faux fantasy of murder. Chapter 4 of Persua‐ sion is when the reader gets the full flashback of Anne refusing Captain Wentworth, how sad she was, and what a terrible mistake she realized it to be
##Chapter 3#Analyzing Word and Document Frequency: tf-idf
A central question in text mining and natural language processing is how to quantify what a document is about.
One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document
There are many frequent words in document, but maynot important , such as the, of, is and so forth. We might take the approach of adding words like these to a list of stop words and removing theme before analysis. But The large difference in frequency may reduce the quality of content.
Another approach is to look at term’s inverse document frequency (idf), which reduce the difference among these
Image
From this formular, we experience that the higher value idf of term is, the less frequent of term is
##Term Frequency in Jane Austen’s Novels Let us start by looking at the published novels of Jane Austen and first examine term frequency, then td-idf
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words<- austen_books()%>% unnest_tokens(word,text)%>% count(book, word,sort=TRUE)%>%ungroup()
total_words<- book_words%>% group_by(book)%>% summarise(total=sum(n))
book_words<- left_join(book_words,total_words)
## Joining, by = "book"
There are some usual words with the high frequency, such as “the”, “and”, “to”, and so forth
We will divide the n for total(n)
library(ggplot2)
ggplot(book_words,aes(n/total,fill=book))+geom_histogram(show.legend = FALSE)+xlim(NA,0.0009)+facet_wrap(~book,ncol=2,scales="free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 896 rows containing non-finite values (stat_bin).
## Warning: Removed 6 rows containing missing values (geom_bar).
##Zipf’s Law
Zipf’s law states that frequency that a word appears is inversely proportional to its rank
freq_by_rank<- book_words%>% group_by(book)%>% mutate(rank=row_number(),`term frequency`=n/total)
freq_by_rank%>%ggplot(aes(rank,`term frequency`, color=book))+geom_line(siz=1.1,alpha=0.8,show.legend = FALSE)+scale_x_log10()+scale_y_log10()
## Warning: Ignoring unknown parameters: siz
##The bind_tf_idf Function
Th idea of tf-idf is to find the important words for the content of each document by decreasing the weight commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.
book_words<- book_words %>% bind_tf_idf(word,book,n)
book_words%>% arrange(desc(tf_idf))%>%mutate(word=factor(word,levels=rev(unique(word))))%>%group_by(book)%>%top_n(15)%>%ungroup()%>%ggplot(aes(word,tf_idf,fill=book))+geom_col(show.legend = FALSE)+labs(x=NULL,y="tf-idf")+facet_wrap(~book,ncol=2,scales="free")+coord_flip()
## Selecting by tf_idf
#This plot show some uncommon words within the top
##A Corpus of Physics Texts
Let us work with another corpus of documents to see twhat terms are important in a different set of works.
library(gutenbergr)
physics<- gutenberg_download(c(37729,14725.13476,5001),meta_fields="author")
## Warning in .f(.x[[i]], ...): Could not download a book at http://
## aleph.gutenberg.org/1/4/7/2/5/./1/3/4/7/14725.13476/14725.13476.zip
## Warning in .f(.x[[i]], ...): Could not download a book at http://
## aleph.gutenberg.org/5/0/0/5001/5001.zip
physics_words<- physics%>% unnest_tokens(word,text)%>% count(author,word,sort=TRUE)%>%ungroup()
plot_physics<- physics_words%>% bind_tf_idf(word,author,n)%>%arrange(desc(tf_idf))%>%mutate(word=factor(word,levels = rev(unique(word))))%>% mutate(author=factor(author, levels = c("Galilei, Galileo",
"Huygens, Christiaan",
"Tesla, Nikola",
"Einstein, Albert")))
plot_physics%>% group_by(author)%>%top_n(15,tf_idf)%>%ungroup()%>%mutate(word=reorder(word,tf_idf))%>%ggplot(aes(word,tf_idf,fill=author))+geom_col(show.legend = FALSE)+labs(x=NULL,y="tf-idf")+facet_wrap(~author,ncol=2,scales="free")+coord_flip()
#Relationships Between Words: N-grams and Correlations
So far, we’ve considered words as individual units, and considered their relationships to sentiments or to documents. However, many interesting text analyses are based on the relationships between words, whether examining which words ten to follow others immediately
##Tokenizing by N-gram
We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentences
We do this by adding the token=“ngrams” option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram. When we set n to 2, we are examining pairs of 2 consecutive words, often called “bigrams”:
library(dplyr)
library(tidytext)
library(janeaustenr)
austen_bigrams<-austen_books()%>% unnest_tokens(bigram,text,token="ngrams",n=2)
##Counting and Filtering N-grams
Our usual tidy tools apply equally well to n-gram analysis. We can examine the most comon bigrams using dplyr’s count():
austen_bigrams%>%count(bigram,sort=TRUE)
## # A tibble: 211,236 x 2
## bigram n
## <chr> <int>
## 1 of the 3017
## 2 to be 2787
## 3 in the 2368
## 4 it was 1781
## 5 i am 1545
## 6 she had 1472
## 7 of her 1445
## 8 to the 1387
## 9 she was 1377
## 10 had been 1299
## # ... with 211,226 more rows
A lot of the most common bigrams are pars of common (unin-interesting) words, such as “of the” and “to be”, what we call “stop words”
This is useful to separate the bigram into 2 words, then match with stop_words by inner_join
library(tidyr)
bigrams_separated<- austen_bigrams%>%separate(bigram,c("word1","word2"),sep=" ")
bigrams_filtered<- bigrams_separated%>% filter(!word1 %in% stop_words$word)%>%filter(!word2 %in% stop_words$word)
bigram_count<- bigrams_filtered%>%count(word1,word2,sort=TRUE)
bigram_count
## # A tibble: 33,421 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 sir thomas 287
## 2 miss crawford 215
## 3 captain wentworth 170
## 4 miss woodhouse 162
## 5 frank churchill 132
## 6 lady russell 118
## 7 lady bertram 114
## 8 sir walter 113
## 9 miss fairfax 109
## 10 colonel brandon 108
## # ... with 33,411 more rows
We can see that names (whether first and last or with a salutation) are the mose com-mon pairs in Jane Austen books
Now, It’s time to unite 2 words to make a meaningful bigrams
bigrams_united<- bigrams_filtered%>% unite(bigram,word1,word2,sep=" ")
bigrams_united
## # A tibble: 44,784 x 2
## book bigram
## <fct> <chr>
## 1 Sense & Sensibility jane austen
## 2 Sense & Sensibility austen 1811
## 3 Sense & Sensibility 1811 chapter
## 4 Sense & Sensibility chapter 1
## 5 Sense & Sensibility norland park
## 6 Sense & Sensibility surrounding acquaintance
## 7 Sense & Sensibility late owner
## 8 Sense & Sensibility advanced age
## 9 Sense & Sensibility constant companion
## 10 Sense & Sensibility happened ten
## # ... with 44,774 more rows
In other Analyses you may be interested in the ose common trigrams, which are consecutive sequences of three words
austen_books()%>% unnest_tokens(trigram,text,token="ngrams",n=3)%>% separate(trigram,c("word1","word2","word3"),sep=" ")%>% filter(!word1 %in% stop_words$word,!word2 %in% stop_words$word,!word3 %in% stop_words$word)%>%count(word1,word2,word3,sort=TRUE)
## # A tibble: 8,757 x 4
## word1 word2 word3 n
## <chr> <chr> <chr> <int>
## 1 dear miss woodhouse 23
## 2 miss de bourgh 18
## 3 lady catherine de 14
## 4 catherine de bourgh 13
## 5 poor miss taylor 11
## 6 sir walter elliot 11
## 7 ten thousand pounds 11
## 8 dear sir thomas 10
## 9 twenty thousand pounds 8
## 10 replied miss crawford 7
## # ... with 8,747 more rows
##Analyzing Bigrams
This one-bigram-per-row format is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most common “streets” mentioned in each book.
bigrams_filtered %>% filter(word2=="street")%>%count(book,word1,sort=TRUE)
## # A tibble: 34 x 3
## book word1 n
## <fct> <chr> <int>
## 1 Sense & Sensibility berkeley 16
## 2 Sense & Sensibility harley 16
## 3 Northanger Abbey pulteney 14
## 4 Northanger Abbey milsom 11
## 5 Mansfield Park wimpole 10
## 6 Pride & Prejudice gracechurch 9
## 7 Sense & Sensibility conduit 6
## 8 Sense & Sensibility bond 5
## 9 Persuasion milsom 5
## 10 Persuasion rivers 4
## # ... with 24 more rows
bigram_tf_idf<- bigrams_united %>% count(book,bigram)%>% bind_tf_idf(bigram,book,n)%>% arrange(desc(tf_idf))
Our sentiment analysis approach in Chapter 2 simply counted the appearance of positive or negative words, according to a reference lexicon.
One of the problems with this approach is that a word’s context can matter nearly as much as its presence. For example, the words “happy” and “like” will be counted as positive, even in a sentence like "I'm not happy", "I dont't like it"
bigrams_separated %>% filter(word1=="not")%>%count(word1,word2,sort=TRUE)
## # A tibble: 1,246 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 not be 610
## 2 not to 355
## 3 not have 327
## 4 not know 252
## 5 not a 189
## 6 not think 176
## 7 not been 160
## 8 not the 147
## 9 not at 129
## 10 not in 118
## # ... with 1,236 more rows
To know more the specific nature, we should use the “afinn” to have a quantited benchmark
AFINN<- get_sentiments("afinn")
not_words<- bigrams_separated%>% filter(word1=="not")%>% inner_join(AFINN,by=c(word2="word"))%>% count(word2,score=value,sort=TRUE)%>% ungroup()
not_words%>% mutate(contribution=n*score)%>%arrange(desc(abs(contribution)))%>% head(20)%>% mutate(word2=reorder(word2,contribution))%>%ggplot(aes(word2,n*score,fill=n*score>0))+geom_col(show.legend = FALSE)+xlab("Words preceded by \"not\"")+ylab("Sentiment score * number of occurences")+coord_flip()
The bigrams “not like” and “not help” were overwhelmingly the largest causes of mis-identification.
So maybe, We can misunderstand the status situation.
Now, let us dive some other aspects, such as “not, no , never, without”
negation_words<- c("not","no","never","without")
negated_words<- bigrams_separated %>% filter(word1 %in% negation_words)%>% inner_join(AFINN, by=c(word2="word"))%>% count(word1,word2,score=value,sort=TRUE)%>% ungroup()
negated_words%>% ggplot(aes(x=word2,y=n*score,fill=n*score))+geom_col()+facet_wrap(~word1)
##Visualizing a Network of Bigrams wih ggraph
We may be interested in visualizing all of the relationships among words simultane-ously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph”. Here we’l be referring to a graph not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has 3 variables:
from
to
weight
library(igraph)
bigram_graph<- bigram_count%>% filter(n>20)%>%graph_from_data_frame()
graph_from_data_frame create the coming from and the going toward points to create the core based platform for adding some other features
library(ggraph)
## Warning: package 'ggraph' was built under R version 3.6.3
set.seed(2017)
ggraph(bigram_graph,layout="fr")+geom_edge_link()+geom_node_point()+geom_node_text(aes(label=name),vjust=1,hjust=1)
We conclude with a fre polishing operations to make a better-looking graph - We add the edge_alpha aesthetic to the link layer to make links transparent based on how common or rare the bigram is - We add directionality with an arrow, constructed using grid::arrow(), including an end_cap option that tells the arrow to end before touching the node - We tinker with the options to the node layer to make the nodes more attractive - We add a theme that’s useful for plotting networks, theme_void()
set.seed(2016)
a<- grid::arrow(type="closed",length=unit(.15,"inches"))
ggraph(bigram_graph,layout="fr")+geom_edge_link(aes(edge_alpha=n),show.legend=FALSE,arrow=a,end_cap=circle(.07,"inches"))+geom_node_point(color="lightblue",size=5)+geom_node_text(aes(label=name),vjust=1,hjust=1)+theme_void()
##Counting and Correlating Pairs of Words with the widyr Package
Tokenizing by n-gram is a useful way to explore pairs of adjacent words. However, we may also be interested in words that tend to co-occur within particular documents or particular chapters, even if they don’t occur next to each other
It helps you can compare the effects between 2 words within 2 scopes (words and other choice (eg: page1:page10,…)
##Counting and Correlating Among Sections
austen_section_words<- austen_books()%>%filter(book=="Pride & Prejudice")%>% mutate(section=row_number()%/%10)%>% filter(section>0)%>% unnest_tokens(word,text)%>%filter(!word %in% stop_words$word)
library(widyr)
word_pairs<- austen_section_words%>% pairwise_count(word,section,sort=TRUE)
"Choose the word darcy"
## [1] "Choose the word darcy"
word_pairs%>% filter(item1=="dancy")
## # A tibble: 0 x 3
## # ... with 3 variables: item1 <chr>, item2 <chr>, n <dbl>
Image
word_cors<- austen_section_words%>% group_by(word)%>%filter(n()>=20)%>%pairwise_cor(word,section,sort=TRUE)
word_cors
## # A tibble: 154,842 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 bourgh de 0.951
## 2 de bourgh 0.951
## 3 pounds thousand 0.701
## 4 thousand pounds 0.701
## 5 william sir 0.664
## 6 sir william 0.664
## 7 catherine lady 0.663
## 8 lady catherine 0.663
## 9 forster colonel 0.622
## 10 colonel forster 0.622
## # ... with 154,832 more rows
word_cors%>% filter(item1=="pounds")
## # A tibble: 393 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 pounds thousand 0.701
## 2 pounds ten 0.231
## 3 pounds fortune 0.164
## 4 pounds settled 0.149
## 5 pounds wickham's 0.142
## 6 pounds children 0.129
## 7 pounds mother's 0.119
## 8 pounds believed 0.0932
## 9 pounds estate 0.0890
## 10 pounds ready 0.0860
## # ... with 383 more rows
set.seed(2016)
a<- grid::arrow(type="closed",length=unit(.15,"inches"))
word_cors%>%filter(correlation>.15)%>% graph_from_data_frame()%>% ggraph(layout="fr")+geom_edge_link(aes(edge_alpha=correlation),show.legend=FALSE,arrow=a,end_cap=circle(.07,"inches"))+geom_node_point(color="lightblue",size=5)+geom_node_text(aes(label=name),repel=TRUE)+theme_void()
##Summary
This chapter showed how the tidy text approach is useful not only for analyzing individual words, but also for exploring the relationships and connections between words
##Chapter 5#Converting to and from Nontidy Formats
In the previous chapters, we’ve been analyzing text arranged in the tidy text format: a table with one token per document per row, such as is constructed by the function unnest_tokens(). This lets us use the popular suuite of tidy tools such as dplyr, tidyr, and ggplot2 to explore and visualize text data
##Tidying a Document-Term matrix
In tidy format - Each row resprent each document
One of the most common structures that text mining packages work with is the documnt-term-matrix or DTM). This is a matrix where:
Since most pairings of document and term do not occur (they have the value zero), DTMS are usually implemented as sparse matrices. These objects can be treated as though they were matrices (for example, accessing particular rows and columns)
DTM objects cannot be used directly with tidy tools, just as tidy data frames cannot be used as input for most text mining packages. Thus, the tidytext package provides 2 verbs that convert between the 2 formats:
##Tidying DocumentTermMatrix Objects
Perhaps the most widely used implementation of DTMs in R is the DocumentTermMatrix class in the tm package
library(tm)
data("AssociatedPress")
AssociatedPress
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
"To extract the term from DTM. Terms function is used for this case"
## [1] "To extract the term from DTM. Terms function is used for this case"
terms<- Terms(AssociatedPress)
"This shows in matrix format"
## [1] "This shows in matrix format"
head(terms)
## [1] "aaron" "abandon" "abandoned" "abandoning" "abbott"
## [6] "abboud"
library(dplyr)
library(tidytext)
ap_td<- tidy(AssociatedPress)
ap_td
## # A tibble: 302,031 x 3
## document term count
## <int> <chr> <dbl>
## 1 1 adding 1
## 2 1 adult 2
## 3 1 ago 1
## 4 1 alcohol 1
## 5 1 allegedly 1
## 6 1 allen 1
## 7 1 apparently 2
## 8 1 appeared 1
## 9 1 arrested 1
## 10 1 assault 1
## # ... with 302,021 more rows
"Notice that only the nonzero values are included in the tidied out‐
put: document 1 includes terms such as “adding” and “adult,” but
not aaron” or abandon.” This means the tidied version has no
rows where count is zero."
## [1] "Notice that only the nonzero values are included in the tidied out-\nput: document 1 includes terms such as “adding” and “adult,” but\nnot aaron” or abandon.” This means the tidied version has no\nrows where count is zero."
ap_sentiments<- ap_td%>%inner_join(get_sentiments("bing"),by=c(term="word"))
ap_sentiments
## # A tibble: 30,094 x 4
## document term count sentiment
## <int> <chr> <dbl> <chr>
## 1 1 assault 1 negative
## 2 1 complex 1 negative
## 3 1 death 1 negative
## 4 1 died 1 negative
## 5 1 good 2 positive
## 6 1 illness 1 negative
## 7 1 killed 2 negative
## 8 1 like 2 positive
## 9 1 liked 1 positive
## 10 1 miracle 1 positive
## # ... with 30,084 more rows
library(ggplot2)
ap_sentiments%>% count(sentiment,term,wt=count)%>%ungroup()%>%filter(n>=200)%>%mutate(n>=200)%>% mutate(n=ifelse(sentiment=="negative",-n,n))%>%mutate(term=reorder(term,n))%>%ggplot(aes(term,n,fill=sentiment))+geom_bar(stat="identity")+ylab("Contribution to sentiment")+coord_flip()
##Tidying dfm Objects
Other text mining packages provide alternative implementations of DTM, such as dfm (document-feature matrix) class from quanteda package
library(methods)
data("data_corpus_inaugural",package="quanteda")
inaug_dfm<- quanteda::dfm(data_corpus_inaugural,verbose=FALSE)
inaug_td<-tidy(inaug_dfm)
##Casting Tidy Text Data into a Matrix
Just as some existing text mining packages provide document term matrices as sample data or output, some algorithm expect such matrices as input. Therefore, tidytext provides cast_verbs for converting from a tidy form to these matrices
ap_td%>%cast_dtm(document,term,count)
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
library(Matrix)
m<- ap_td%>% cast_sparse(document,term,count)
This kind of conversion could easily be done from any of tidy text structures we’ve used so far in this book. For example, we could create a DTM of Jane Austen’s books in just a few lines of code
library(janeaustenr)
austen_dtm<- austen_books()%>%unnest_tokens(word,text)%>% count(book,word)%>%cast_dtm(book,word,n)
austen_dtm
## <<DocumentTermMatrix (documents: 6, terms: 14520)>>
## Non-/sparse entries: 40379/46741
## Sparsity : 54%
## Maximal term length: 19
## Weighting : term frequency (tf)
##Chapter 6
#Topic Modeling
In text mining, we often have collections of documents, such as blog post or news articles, that we’d like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.
Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It reats each document as a mixture of topics, and each topic as a mixture of words
Image
##Latent Dirichlet Allocation
Laten Dirichlet allocation is one of the most common algorithms for topic modelling.
Every document is a mixture of topics: Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B
Every topic is a mixture of words: President, Cogress and goverment are from politics
library(topicmodels)
data("AssociatedPress")
AssociatedPress
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
We can use the LDA() function from topicmodels packge, setting k=3, to create 2 topic LDA model
ap_lda<- LDA(AssociatedPress,k=2,control=list(seed=1234))
ap_lda
## A LDA_VEM topic model with 2 topics.
##Word-Topic Probabilities
In Chapter 5 we introduced the tidy() method, originally from broom package, for tidying model objcts. The tidytext package provides this method for extracting the per-topic-per-word probabilities, call (“beta”) from the model
library(tidytext)
ap_topics<-tidy(ap_lda,matrix="beta")
ap_topics
## # A tibble: 20,946 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 aaron 1.69e-12
## 2 2 aaron 3.90e- 5
## 3 1 abandon 2.65e- 5
## 4 2 abandon 3.99e- 5
## 5 1 abandoned 1.39e- 4
## 6 2 abandoned 5.88e- 5
## 7 1 abandoning 2.45e-33
## 8 2 abandoning 2.34e- 5
## 9 1 abbott 2.13e- 6
## 10 2 abbott 2.97e- 5
## # ... with 20,936 more rows
Notice that this has turned the model into a one-topic-per-term-per-row format. Beta is the probability of that term being generated from that topic
library(ggplot2)
library(dplyr)
ap_top_terms<- ap_topics%>%group_by(topic)%>% top_n(10,beta)%>%ungroup()%>%arrange(topic,-beta)
ap_top_terms%>% mutate(term=reorder(term,beta))%>%ggplot(aes(term,beta,fill=factor(topic)))+geom_col(show.legend=FALSE)+facet_wrap(~topic,scales="free")+coord_flip()
This visualization lets us understand the two topics that were extracted from the arti‐ cles. The most common words in topic 1 include “percent,” “million,” “billion,” and “company,” which suggests it may represent business or financial news. Those most common in topic 2 include “president,” “government,” and “soviet,” suggeting that this topic represents political news
One important observation about the words in each topic is that some words, such as “new” and “people,” are common within both topics.
This is an advantage of topic modeling as opposed to “hard clustering” methods: top‐ ics used in natural language could have some overlap in terms of words. As an alternative, we could consider the terms that had the greatest difference in β between topic 1 and topic 2. This can be estimated based on the log ratio of the two: log2(B2/B1)
library(tidyr)
beta_spread<- ap_topics%>% mutate(topic=paste0("topic",topic))%>%spread(topic,beta)%>%filter(topic1>.001|topic2 > .001)%>%mutate(log_ratio=log2(topic2/topic1))
beta_spread
## # A tibble: 198 x 4
## term topic1 topic2 log_ratio
## <chr> <dbl> <dbl> <dbl>
## 1 administration 0.000431 0.00138 1.68
## 2 ago 0.00107 0.000842 -0.339
## 3 agreement 0.000671 0.00104 0.630
## 4 aid 0.0000476 0.00105 4.46
## 5 air 0.00214 0.000297 -2.85
## 6 american 0.00203 0.00168 -0.270
## 7 analysts 0.00109 0.000000578 -10.9
## 8 area 0.00137 0.000231 -2.57
## 9 army 0.000262 0.00105 2.00
## 10 asked 0.000189 0.00156 3.05
## # ... with 188 more rows
##Document-Topic Probabilities
Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics
ap_documents<-tidy(ap_lda,matrix="gamma")
ap_documents
## # A tibble: 4,492 x 3
## document topic gamma
## <int> <int> <dbl>
## 1 1 1 0.248
## 2 2 1 0.362
## 3 3 1 0.527
## 4 4 1 0.357
## 5 5 1 0.181
## 6 6 1 0.000588
## 7 7 1 0.773
## 8 8 1 0.00445
## 9 9 1 0.967
## 10 10 1 0.147
## # ... with 4,482 more rows
Document 6 is quite high relevant to topic 2
##By-Word Assignments:augement
One step of LDA algorithm is assigning each word in each document to a topic
##Summary
This chapter introduced topic modeling for finding clusters of words that characterize a set of documents, and shoed hoe the tidy() verb lets us explore and understand these models using dplyr and ggplot2