Instructions In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:

Work with a different corpus of your choosing, and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.

I will analyze a sentiment of the Harry Potter series written by. Rowling. We want to know how the words in each chapter are associated with positive or negative feelings using different dictionaries. Bing, NRC, afinn, and other variants created with Loughran dictionaries.

The three general-purpose lexicons are: • AFINN from Finn Årup Nielsen • Bing from Bing Liu and collaborators • NRC from Saif Mohammad and Peter Turney

Sentiment asalysis with tidy data

Word and document analysis with tf-idf

Relation between words

Converting to and from non-tidy formate.

#Topic modeling

#add library

library(tidytext)
## Warning: package 'tidytext' was built under R version 4.4.2
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.4.2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
## Warning: package 'stringr' was built under R version 4.4.2
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.4.2
## Loading required package: RColorBrewer
library(reshape2)
library(harrypotter)
## Warning: package 'harrypotter' was built under R version 4.4.2
library(httr)  
library(readtext)
## Warning: package 'readtext' was built under R version 4.4.2
library(textdata)
## Warning: package 'textdata' was built under R version 4.4.2
## 
## Attaching package: 'textdata'
## The following object is masked from 'package:httr':
## 
##     cache_info
library(reshape2)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(wordcloud)
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.4.2
library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 4.4.2
library(harrypotter)
library(devtools)
## Warning: package 'devtools' was built under R version 4.4.2
## Loading required package: usethis
library(tibble)
## Warning: package 'tibble' was built under R version 4.4.2
library(sos)
## Warning: package 'sos' was built under R version 4.4.2
## Loading required package: brew
## 
## Attaching package: 'sos'
## The following object is masked from 'package:tidyr':
## 
##     matches
## The following object is masked from 'package:dplyr':
## 
##     matches
## The following object is masked from 'package:utils':
## 
##     ?

sentiment

sentiments
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows

get specific sentiment lexicons of afinn

get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows

get specific sentiment lexicons of bing

get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows

get specific sentiment lexicons of nrc

get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows

Sentiment Analysis with group_by

 tidy_books <- austen_books() %>%
 group_by(book) %>%
 mutate(linenumber = row_number(),
 chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
 ignore_case = TRUE)))) %>%
 ungroup() %>%
 unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 × 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # ℹ 725,045 more rows

austen best know books.

 austen_books() %>%
  distinct(book)
## # A tibble: 6 × 1
##   book               
##   <fct>              
## 1 Sense & Sensibility
## 2 Pride & Prejudice  
## 3 Mansfield Park     
## 4 Emma               
## 5 Northanger Abbey   
## 6 Persuasion

Identify the line number and use tidyverse function group_by

original_books <- austen_books()%>%
  group_by(book)%>%
  mutate(line= row_number())%>%
  ungroup()
original_books
## # A tibble: 73,422 × 3
##    text                    book                 line
##    <chr>                   <fct>               <int>
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility     1
##  2 ""                      Sense & Sensibility     2
##  3 "by Jane Austen"        Sense & Sensibility     3
##  4 ""                      Sense & Sensibility     4
##  5 "(1811)"                Sense & Sensibility     5
##  6 ""                      Sense & Sensibility     6
##  7 ""                      Sense & Sensibility     7
##  8 ""                      Sense & Sensibility     8
##  9 ""                      Sense & Sensibility     9
## 10 "CHAPTER 1"             Sense & Sensibility    10
## # ℹ 73,412 more rows

make each single word a token

tidy_books <-original_books %>%
  unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 × 3
##    book                 line word       
##    <fct>               <int> <chr>      
##  1 Sense & Sensibility     1 sense      
##  2 Sense & Sensibility     1 and        
##  3 Sense & Sensibility     1 sensibility
##  4 Sense & Sensibility     3 by         
##  5 Sense & Sensibility     3 jane       
##  6 Sense & Sensibility     3 austen     
##  7 Sense & Sensibility     5 1811       
##  8 Sense & Sensibility    10 chapter    
##  9 Sense & Sensibility    10 1          
## 10 Sense & Sensibility    13 the        
## # ℹ 725,045 more rows

#remove stopwords usning anti_join function

match_stop_word <-tidy_books %>%
  anti_join(get_stopwords())
## Joining with `by = join_by(word)`
match_stop_word
## # A tibble: 325,084 × 3
##    book                 line word       
##    <fct>               <int> <chr>      
##  1 Sense & Sensibility     1 sense      
##  2 Sense & Sensibility     1 sensibility
##  3 Sense & Sensibility     3 jane       
##  4 Sense & Sensibility     3 austen     
##  5 Sense & Sensibility     5 1811       
##  6 Sense & Sensibility    10 chapter    
##  7 Sense & Sensibility    10 1          
##  8 Sense & Sensibility    13 family     
##  9 Sense & Sensibility    13 dashwood   
## 10 Sense & Sensibility    13 long       
## # ℹ 325,074 more rows

calculate word frequency and return sort order.

match_stop_word %>%
  count(word, sort= TRUE)
## # A tibble: 14,375 × 2
##    word      n
##    <chr> <int>
##  1 mr     3015
##  2 mrs    2446
##  3 must   2071
##  4 said   2041
##  5 much   1935
##  6 miss   1855
##  7 one    1831
##  8 well   1523
##  9 every  1456
## 10 think  1440
## # ℹ 14,365 more rows

Word clouds

match_stop_word %>%
  count(word, sort=TRUE) %>%
  head(100) %>%
  wordcloud2(size = 0.4, shape = 'triangle-forward',
             color = c("streeblue","firebrick","darkorchid"),
             backgroundColor = "green")

non-interactive word cloud

match_stop_word %>%
  count(word)%>%
  with(wordcloud::wordcloud(word,n,max.words = 100))

positive word in the bing dictionary

positive <- get_sentiments("bing") %>%
  filter(sentiment == "positive")
positive
## # A tibble: 2,005 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abound      positive 
##  2 abounds     positive 
##  3 abundance   positive 
##  4 abundant    positive 
##  5 accessable  positive 
##  6 accessible  positive 
##  7 acclaim     positive 
##  8 acclaimed   positive 
##  9 acclamation positive 
## 10 accolade    positive 
## # ℹ 1,995 more rows

#count positive word in Emma

tidy_books %>%
  filter(book == "Emma") %>%
  semi_join(positive)%>%
  count(word, sort=TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 668 × 2
##    word         n
##    <chr>    <int>
##  1 well       401
##  2 good       359
##  3 great      264
##  4 like       200
##  5 better     173
##  6 enough     129
##  7 happy      125
##  8 love       117
##  9 pleasure   115
## 10 right       92
## # ℹ 658 more rows

Count negative and positive words of 80 lines of text.

bing <-get_sentiments("bing")
 janeaustensentiment <- tidy_books %>%
 inner_join(bing) %>%
 count(book, index = line %/% 80, sentiment) %>%
 spread(sentiment, n, fill = 0) %>%
 mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
janeaustensentiment
## # A tibble: 920 × 5
##    book                index negative positive sentiment
##    <fct>               <dbl>    <dbl>    <dbl>     <dbl>
##  1 Sense & Sensibility     0       16       32        16
##  2 Sense & Sensibility     1       19       53        34
##  3 Sense & Sensibility     2       12       31        19
##  4 Sense & Sensibility     3       15       31        16
##  5 Sense & Sensibility     4       16       34        18
##  6 Sense & Sensibility     5       16       51        35
##  7 Sense & Sensibility     6       24       40        16
##  8 Sense & Sensibility     7       23       51        28
##  9 Sense & Sensibility     8       30       40        10
## 10 Sense & Sensibility     9       15       19         4
## # ℹ 910 more rows

visualize

positive =green and negative=red

janeaustensentiment %>%
  ggplot(aes(index, sentiment,))+
  geom_col(show.legend = FALSE, fill="cadetblue")+
  geom_col(data= .%>%filter(sentiment<0),show.legend = FALSE, fill="firebrick")+
  geom_hline(yintercept = 0, color="goldenrod")+
  facet_wrap(~book,ncol=2,scales="free_x")

#Most common positive and negative words

bing_word_counts <-tidy_books %>%
  inner_join(bing)%>%
  count(word,sentiment, sort=TRUE)
## Joining with `by = join_by(word)`
## Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bing_word_counts
## # A tibble: 2,585 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ℹ 2,575 more rows

Sentiment Analysis with inner join

 nrcjoy <- get_sentiments("nrc") %>%
 filter(sentiment == "joy")
 tidy_books %>%
 filter(book == "Emma") %>%
 inner_join(nrcjoy) %>%
 count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ℹ 291 more rows

make a plot of sentiment scores against the index on the x-axis that keeps track of narrative time in sections of text

 ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
 geom_col(show.legend = FALSE) +
 facet_wrap(~book, ncol = 2, scales = "free_x")

Now Comparing the Three Sentiment Dictionaries

  pride_prejudice <- tidy_books %>%
 filter(book == "Pride & Prejudice")
 pride_prejudice
## # A tibble: 122,204 × 3
##    book               line word     
##    <fct>             <int> <chr>    
##  1 Pride & Prejudice     1 pride    
##  2 Pride & Prejudice     1 and      
##  3 Pride & Prejudice     1 prejudice
##  4 Pride & Prejudice     3 by       
##  5 Pride & Prejudice     3 jane     
##  6 Pride & Prejudice     3 austen   
##  7 Pride & Prejudice     7 chapter  
##  8 Pride & Prejudice     7 1        
##  9 Pride & Prejudice    10 it       
## 10 Pride & Prejudice    10 is       
## # ℹ 122,194 more rows

#word in Emma matching with AFINN

emma_afin <- tidy_books %>%
  filter(book=="Emma")%>%
  anti_join(get_stopwords())%>%
  inner_join(get_sentiments("afinn"))
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
emma_afin
## # A tibble: 10,159 × 4
##    book   line word         value
##    <fct> <int> <chr>        <dbl>
##  1 Emma     15 clever           2
##  2 Emma     15 rich             2
##  3 Emma     15 comfortable      2
##  4 Emma     16 happy            3
##  5 Emma     16 best             3
##  6 Emma     18 distress        -2
##  7 Emma     20 affectionate     3
##  8 Emma     22 died            -3
##  9 Emma     24 excellent        3
## 10 Emma     25 fallen          -2
## # ℹ 10,149 more rows

count

emma_afin %>%
  count(word,sort=TRUE)
## # A tibble: 894 × 2
##    word       n
##    <chr>  <int>
##  1 miss     599
##  2 good     359
##  3 great    264
##  4 dear     241
##  5 like     200
##  6 better   173
##  7 hope     143
##  8 poor     136
##  9 wish     135
## 10 happy    125
## # ℹ 884 more rows

#calculate sentiment #make Sections

emma_afinn_sentiment <-emma_afin %>%
  mutate(word_count=1:n(),
         index=word_count %/%80)%>%
  group_by(index)%>%
  summarize(sentiment=sum(value))
emma_afinn_sentiment
## # A tibble: 127 × 2
##    index sentiment
##    <dbl>     <dbl>
##  1     0        40
##  2     1        33
##  3     2        77
##  4     3        84
##  5     4        52
##  6     5        80
##  7     6        98
##  8     7        80
##  9     8        69
## 10     9        68
## # ℹ 117 more rows

#visualize

emma_afin %>%
  mutate(word_count=1:n(),
         index=word_count %/%80)%>%
  filter(index==104)%>%
  count(word, sort=TRUE)%>%
  with(wordcloud::wordcloud(word,n,rot.per=.3))

emma_afin%>%
  mutate(word_count=1:n(),
         index=word_count %/%80)%>%
  filter(index==104)%>%
  count(word, sort=TRUE)%>%
  wordcloud2(size= 0.4,shape='diamond',
             backgroundColor = "darkseagreen")

#visualize

emma_afinn_sentiment %>%
  ggplot(aes(index,sentiment))+
  geom_col(aes(fill=cut_interval(sentiment,n=5)))+
  geom_hline(yintercept = 0,color="forestgreen",linetype="dashed")+
  scale_fill_brewer(palette = "RdBu",guide=FALSE)+
  theme(panel.background =element_rect(fill="grey"),
        plot.background = element_rect(fill="grey"),
        panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())+
          labs(title = "Afinn sentiment analysis of Emma")
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# boxplot

emma_afin %>%
  mutate(word_count=1:n(),
         index=as.character(word_count %/%80))%>%
  filter(index==10 |index==104 |index==105)%>%
  ggplot(aes(value,index))+
  geom_boxplot()+
  geom_jitter()+
  coord_flip()+
  labs(y="section",x="Afinn")

inner_join() to calculate the sentiment

afinn <- get_sentiments("afinn")
 afinn <- pride_prejudice %>%
 inner_join(afinn) %>%
 group_by(index = line %/% 80) %>%
 summarise(sentiment = sum(value)) %>%
 mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
 pride_prejudice %>%
 inner_join(bing) %>%
 mutate(method = "Bing et al."),
 pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative")))%>%
 mutate(method = "NRC")) %>%
 count(method, index = line %/% 80, sentiment) %>%
 spread(sentiment, n, fill = 0) %>%
 mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Make a plot to visualize net sentiment of positive - negative.

bind_rows(afinn,
 bing_and_nrc) %>%
 ggplot(aes(index, sentiment, fill = method)) +
 geom_col(show.legend = FALSE) +
 facet_wrap(~method, ncol = 1, scales = "free_y")

count the nrc sentiments of positive and negative.

get_sentiments("nrc") %>%
 filter(sentiment %in% c("positive",
 "negative")) %>%
 count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308

# count the bing sentiments of positive and negative.

get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Most Common Positive and Negative Words

bing_word_counts <- tidy_books %>%
 inner_join(get_sentiments("bing")) %>%
 count(word, sentiment, sort = TRUE) %>%
 ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
 bing_word_counts
## # A tibble: 2,585 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ℹ 2,575 more rows

Plot that contribute to positive and negative sentiment in Jane Austen’s Snovels

 bing_word_counts %>%
 group_by(sentiment) %>%
 top_n(10) %>%
 ungroup() %>%
 mutate(word = reorder(word, n)) %>%
 ggplot(aes(word, n, fill = sentiment)) +
 geom_col(show.legend = FALSE) +
 facet_wrap(~sentiment, scales = "free_y") +
 labs(y = "Contribution to sentiment",
 x = NULL) +
 coord_flip()
## Selecting by n

#custom stop words

custom_stop_words <- bind_rows(data_frame(word = c("miss"),
                                          lexicon = c("custom")),
                               stop_words)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
 custom_stop_words
## # A tibble: 1,150 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ℹ 1,140 more rows

The most common words in Jane Austen’s novels

 tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`

#Most common positive and negative words in Jane Austen’s novels

 tidy_books %>%
 inner_join(get_sentiments("bing")) %>%
 count(word, sentiment, sort = TRUE) %>%
 acast(word ~ sentiment, value.var = "n", fill = 0) %>%
 comparison.cloud(colors = c("gray20", "gray80"),
 max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Looking at Units Beyond Just Words

PandP_sentences <- data_frame(text = prideprejudice) %>%
 unnest_tokens(sentence, text, token = "sentences")

 PandP_sentences$sentence[2]
## [1] "by jane austen"

 austen_chapters <- austen_books() %>%
 group_by(book) %>%
 unnest_tokens(chapter, text, token = "regex",
 pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
 ungroup()
 austen_chapters %>%
 group_by(book) %>%
 summarise(chapters = n())
## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
 bingnegative <- get_sentiments("bing") %>%
 filter(sentiment == "negative")
 wordcounts <- tidy_books %>%
 group_by(book) %>%
 summarize(words = n())
 tidy_books %>%
 semi_join(bingnegative) %>%
 group_by(book) %>%
 summarize(negativewords = n()) %>%
 left_join(wordcounts, by = c("book")) %>%
 mutate(ratio = negativewords/words) %>%
 top_n(1) %>%
 ungroup()
## Joining with `by = join_by(word)`
## Selecting by ratio
## # A tibble: 1 × 4
##   book             negativewords words  ratio
##   <fct>                    <int> <int>  <dbl>
## 1 Northanger Abbey          2518 77780 0.0324

Analysis:

Now, we will obtain a code example from Chapter 2 of Textmining with R.

devtools::install_github("ropensci/gutenbergr")
## WARNING: Rtools is required to build R packages, but is not currently installed.
## 
## Please download and install Rtools 4.4 from https://cran.r-project.org/bin/windows/Rtools/.
## Using GitHub PAT from the git credential store.
## Downloading GitHub repo ropensci/gutenbergr@HEAD
## 
## ── R CMD build ─────────────────────────────────────────────────────────────────
## WARNING: Rtools is required to build R packages, but is not currently installed.
## 
## Please download and install Rtools 4.4 from https://cran.r-project.org/bin/windows/Rtools/.
##          checking for file 'C:\Users\asadn\AppData\Local\Temp\RtmpsJdvxn\remotes2dc059556056\ropensci-gutenbergr-57b1415/DESCRIPTION' ...  ✔  checking for file 'C:\Users\asadn\AppData\Local\Temp\RtmpsJdvxn\remotes2dc059556056\ropensci-gutenbergr-57b1415/DESCRIPTION' (370ms)
##       ─  preparing 'gutenbergr':
##    checking DESCRIPTION meta-information ...     checking DESCRIPTION meta-information ...   ✔  checking DESCRIPTION meta-information
##       ─  checking for LF line-endings in source and make files and shell scripts (371ms)
##   ─  checking for empty or unneeded directories
##   ─  building 'gutenbergr_0.2.4.9000.tar.gz'
##      
## 
## Warning: package 'gutenbergr' is in use and will not be installed
 library(gutenbergr)
dickens_books <- gutenberg_works(author == 'Dickens, Charles')

dickens_books
## # A tibble: 54 × 8
##    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
##           <int> <chr>    <chr>                <int> <chr>    <chr>              
##  1           46 A Chris… Dicke…                  37 en       "Children's Litera…
##  2          564 The Mys… Dicke…                  37 en       "Mystery Fiction"  
##  3          580 The Pic… Dicke…                  37 en       "Best Books Ever L…
##  4          699 A Child… Dicke…                  37 en       "Children's Histor…
##  5          700 The Old… Dicke…                  37 en       ""                 
##  6          730 Oliver … Dicke…                  37 en       ""                 
##  7          766 David C… Dicke…                  37 en       "Harvard Classics" 
##  8          821 Dombey … Dicke…                  37 en       ""                 
##  9          917 Barnaby… Dicke…                  37 en       "Historical Fictio…
## 10          963 Little … Dicke…                  37 en       ""                 
## # ℹ 44 more rows
## # ℹ 2 more variables: rights <chr>, has_text <lgl>
head(dickens_books)
## # A tibble: 6 × 8
##   gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
##          <int> <chr>     <chr>                <int> <chr>    <chr>              
## 1           46 A Christ… Dicke…                  37 en       "Children's Litera…
## 2          564 The Myst… Dicke…                  37 en       "Mystery Fiction"  
## 3          580 The Pick… Dicke…                  37 en       "Best Books Ever L…
## 4          699 A Child'… Dicke…                  37 en       "Children's Histor…
## 5          700 The Old … Dicke…                  37 en       ""                 
## 6          730 Oliver T… Dicke…                  37 en       ""                 
## # ℹ 2 more variables: rights <chr>, has_text <lgl>
glimpse(dickens_books)
## Rows: 54
## Columns: 8
## $ gutenberg_id        <int> 46, 564, 580, 699, 700, 730, 766, 821, 917, 963, 9…
## $ title               <chr> "A Christmas Carol in Prose; Being a Ghost Story o…
## $ author              <chr> "Dickens, Charles", "Dickens, Charles", "Dickens, …
## $ gutenberg_author_id <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37…
## $ language            <chr> "en", "en", "en", "en", "en", "en", "en", "en", "e…
## $ gutenberg_bookshelf <chr> "Children's Literature/Christmas", "Mystery Fictio…
## $ rights              <chr> "Public domain in the USA.", "Public domain in the…
## $ has_text            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…

#check data tidy

tidydata <- dickens_books %>%
  gutenberg_download(meta_fields = 'title') %>%
  group_by(gutenberg_id) %>%
  mutate(linenumber = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
## Warning: ! Could not download a book at http://aleph.gutenberg.org/4/46/46.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/7/3/730/730.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/7/6/766/766.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/9/6/963/963.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/1/0/2/1023/1023.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/1/4/2/1423/1423.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/3/0/1/2/30127/30127.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/0/7/2/40723/40723.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/0/7/2/40729/40729.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/1/7/3/41739/41739.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/1/8/9/41894/41894.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/2/2/3/42232/42232.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/3/1/1/43111/43111.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/3/2/0/43207/43207.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/6/6/7/46675/46675.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/7/5/3/47534/47534.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
##   http://aleph.gutenberg.org/4/7/5/3/47535/47535.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.

#identyfy the line number

line_number <- dickens_books%>%
  group_by(gutenberg_id) %>%
  mutate(line = row_number()) %>%
  ungroup()
line_number
## # A tibble: 54 × 9
##    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
##           <int> <chr>    <chr>                <int> <chr>    <chr>              
##  1           46 A Chris… Dicke…                  37 en       "Children's Litera…
##  2          564 The Mys… Dicke…                  37 en       "Mystery Fiction"  
##  3          580 The Pic… Dicke…                  37 en       "Best Books Ever L…
##  4          699 A Child… Dicke…                  37 en       "Children's Histor…
##  5          700 The Old… Dicke…                  37 en       ""                 
##  6          730 Oliver … Dicke…                  37 en       ""                 
##  7          766 David C… Dicke…                  37 en       "Harvard Classics" 
##  8          821 Dombey … Dicke…                  37 en       ""                 
##  9          917 Barnaby… Dicke…                  37 en       "Historical Fictio…
## 10          963 Little … Dicke…                  37 en       ""                 
## # ℹ 44 more rows
## # ℹ 3 more variables: rights <chr>, has_text <lgl>, line <int>

Sentiment text Analysis

create victor

text <- c("Sentiment text analysis for chapter two -",
          "Start by getting the primary example code -",
          "should provide a citation to this base code -",
          "Incorporate at least one additional sentiment lexicon-",
          "You make work on a small team on this assignment"
          )
text
## [1] "Sentiment text analysis for chapter two -"             
## [2] "Start by getting the primary example code -"           
## [3] "should provide a citation to this base code -"         
## [4] "Incorporate at least one additional sentiment lexicon-"
## [5] "You make work on a small team on this assignment"

#tidy table

text_df <- tibble(line =1:5, text=text)
text_df
## # A tibble: 5 × 2
##    line text                                                  
##   <int> <chr>                                                 
## 1     1 Sentiment text analysis for chapter two -             
## 2     2 Start by getting the primary example code -           
## 3     3 should provide a citation to this base code -         
## 4     4 Incorporate at least one additional sentiment lexicon-
## 5     5 You make work on a small team on this assignment

Tokenization

text_df %>%
  unnest_tokens(word, text)
## # A tibble: 38 × 2
##     line word     
##    <int> <chr>    
##  1     1 sentiment
##  2     1 text     
##  3     1 analysis 
##  4     1 for      
##  5     1 chapter  
##  6     1 two      
##  7     2 start    
##  8     2 by       
##  9     2 getting  
## 10     2 the      
## # ℹ 28 more rows

Source:

#citation:

Robinson, J. S. and D. (n.d.). Welcome to text mining with r: Text mining with R. A Tidy Approach. https://www.tidytextmining.com/

Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach. O’Reilly Media, 2017.