NLP of MLP: My Little Pony text mining

Sentiment Analysis of My Little Pony naratives

Introduction

The following code processes transcripts from the cartoon show My Little Pony: Friendship is Magic. The text data is availible as .csv files on kaggle.com. There are several .csv files in the data set bundle. However, this demo will only be working with the ‘clean_data.csv’. This file contains line-by-line transcripts of the show with each corresponding episode title, writer and character (pony).
There are four features in the ‘clean_data.csv’:

	Feature	Description
1.	title	episode title
2.	writer	primary writer of the episode
3.	pony	character who delivered dialog line
4.	dialogue	text line of dialogue

This demo will follow through the sentiment analysis example code given by chapter 2 in the text ‘Text Mining with R’. The original code is implemented in the section below ‘Sentiment Analysis with Tidy Data’.
Begin by loading the necessary R libraries and the text data:

library( dplyr )
library( tidyverse )
library( tidytext )
library( textdata )
library( ggplot2 )
library( RColorBrewer )
library( wordcloud )
library( reshape2 )

The ‘clean_dialog.csv’ has been uploaded to the author’s github account to facilitate loading into the RStudio environment as an R data.frame:

mlpURL <- "https://raw.githubusercontent.com/SmilodonCub/DATA607/master/my-little-pony-transcript/clean_dialog.csv"
mlp_df <- read.csv( mlpURL )
dim( mlp_df )

## [1] 36859     4

colnames( mlp_df )

## [1] "title"  "writer" "pony"   "dialog"

There are many general purpose text lexicons available, three that will be used here: ‘afinn’, ‘bing’ and ‘nrc’. Here is a brief look at them:

head(get_sentiments( "afinn"))
head(get_sentiments( "bing"))
head(get_sentiments( "nrc"))

Sentiment Analysis with Inner Join

This section will explore sentiment as a function of narative time for a subset of My Little Pony episodes. How does sentiment vary over the course of episode dialogue?

First, wrangle the text data to a ‘tidy’ format:

episodeLines <- mlp_df %>%
  group_by( title ) %>% #with respect to episode title:
  mutate( id = row_number()) %>% #add a new feature 'id' to enumerate each row of text
  group_by( title, id) %>% #with respect to episode title & line of text(id):
  summarise( lines = paste(dialog, collapse = "&&")) %>% 
  #paste all episode lines together delimited by '&&'
  mutate( lines = str_split( lines, "&&") ) %>%
  #mutate lines to a list of lines 
  unnest( lines ) %>% #unnest list of lines to one line per row
  unnest_tokens(word, lines) #one token/word per line
head(episodeLines)

## # A tibble: 6 x 3
## # Groups:   title [197]
##   title              id word 
##   <fct>           <int> <chr>
## 1 28 Pranks Later     1 i'm  
## 2 28 Pranks Later     1 so   
## 3 28 Pranks Later     1 sorry
## 4 28 Pranks Later     1 i    
## 5 28 Pranks Later     1 lost 
## 6 28 Pranks Later     1 track

The data is now organized such that each word, or token of text data is given a row with corresponding features for the line identity (id) and the episode title (title). Now perform an inner join with the ‘bing’ lexicon.

#list of first 6 episode title
first6Episodes <- unique(episodeLines$title)[1:6]

#perform an inner join with the bing lexicon
pony_sentiment <- episodeLines %>%
  filter( title %in% first6Episodes ) %>% #subset for the first 8 episodes
  inner_join(get_sentiments("bing")) %>% #inner join with 'bing' lexicon
  #for each title, tally the sentiment score of the 
  #tokens in increments of 10 lines
  count(title, index = id %/% 10, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
head( pony_sentiment )

## # A tibble: 6 x 5
## # Groups:   title [197]
##   title           index negative positive sentiment
##   <fct>           <dbl>    <dbl>    <dbl>     <dbl>
## 1 28 Pranks Later     0        9        2        -7
## 2 28 Pranks Later     1        7        8         1
## 3 28 Pranks Later     2        4        7         3
## 4 28 Pranks Later     3        3        2        -1
## 5 28 Pranks Later     4        3        1        -2
## 6 28 Pranks Later     5        0        1         1

Visualize sentiment over the course of episode narative:

colourCount = length(unique(pony_sentiment$title))
mycolors = colorRampPalette(brewer.pal(50, "PuRd"))(colourCount)

ggplot(pony_sentiment, aes(index, sentiment,color='black', fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x") +
  scale_fill_manual( values = mycolors) +
  labs( title = 'Sentiment Analysis', subtitle="Sentiment across episode trajectory for the first 6 episodes")

This visualization shows how the narative sentiment changes over the trajectory of each episode.

Comparing Sentiment Lexicons

There are numerous sentiment lexicons. This compares the performance of three general purpose tidytext lexicons on the same sample text data:

#combining 2 episodes to have a longer narative to analyze:
doubleEpisode <- c('A Canterlot Wedding - Part 1', 'A Canterlot Wedding - Part 2')
#a dataframe for the first episode:
doubleEpLines1 <- episodeLines %>%
  filter( title %in% doubleEpisode[1] )
addlines <- max(doubleEpLines1$id)
#a dataframe for the second that increments the line 'id'
doubleEpLines2 <- episodeLines %>%
  filter( title %in% doubleEpisode[2] ) %>%
  mutate( id = id + addlines )
#bind the two episodes to one dataframe
doubleEpLines <- rbind( doubleEpLines1, doubleEpLines2)

#inner join with 'afinn' and sum sentiment value for every 10 lines
afinn <- doubleEpLines %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = id %/% 10) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
#inner join with 'bing' and 'nrc' also sum sentiment value for every 10 lines
bing_and_nrc <- bind_rows(doubleEpLines  %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          doubleEpLines  %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = id %/% 10, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

#bind afinn, bing, and nrc data then visualize the narative sentiments
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment,color='black', fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  scale_fill_manual( values = mycolors[2:4]) +
  labs( title = 'Sentiment Analysis', subtitle="Sentiment for the same text with 3 different lexicons")

The general profile of the sentiment data follows the same envelope for the three lexicons (e.g. there is a peak at index 10 for all three data sets). However, there are noticable variations. For instance, the ‘nrc’ lexicon tends to label text data as more positive than ‘afinn’ and ‘bing’. It is important to know that these subtle differences exist and to note in methods sections which lexicon was used so that analysis and results can be reproduceable.

This code demonstrates the differences in lexicon sentiment criterion:

nrc_sent <- get_sentiments("nrc") %>% 
     filter(sentiment %in% c("positive", 
                             "negative")) %>% 
  count(sentiment)
nrc_ratio <- nrc_sent$n[1]/nrc_sent$n[2]

bing_sent <- get_sentiments("bing") %>% 
  count(sentiment)
bing_ratio <- bing_sent$n[1]/bing_sent$n[2]
 
cat( 'The +/- ratio for nrc=', nrc_ratio, 'this < the +/- bing =',bing_ratio)

## The +/- ratio for nrc= 1.437716 this < the +/- bing = 2.384539

The above output demonstrates that overall, the bing lexicon has relatively more negative labels than nrc.

Most Common Positive & Negative Words

Performing an inner join to label sentiment with tokens is beneficial. Here the manipulation is used to explore the most frequent positive and negative words that appear in My Little Pony Episodes.

bing_word_counts <- episodeLines %>% 
  group_by( word ) %>% #group with respect to word,
  summarise( n = n()) %>% #count a total for each words occurance
  inner_join(get_sentiments("bing")) %>% #join bing sentiments
  arrange( desc( n )) #arrange in descending order
head(bing_word_counts)

## # A tibble: 6 x 3
##   word      n sentiment
##   <chr> <int> <chr>    
## 1 like   2021 positive 
## 2 well   1717 positive 
## 3 right  1492 positive 
## 4 good    971 positive 
## 5 sorry   650 negative 
## 6 best    592 positive

Visualize the top 15 most frequent positive and negative words.

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(15,n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n,color='black', fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() +
  scale_fill_manual( values = mycolors[3:4]) +
  labs( title = 'Most Common Words', subtitle="15 most frequent positive and negative words that appear across all transcripts")

From the figure above, it is appearant that the most common positive word sentiments make a much larger contribution than the most common negative words….as one would hope for a children’s (or for the young at heart) cartoon show.

There is a problem! In the top 10 negative words, ‘discord’ is listed. This is problematic, because Discord is the name of a villain in My Little Pony. We would like to add this nameto a list of stop words. Stop words are words that are to be excluded from further analysis.

#add 'discord' to the stop words...
custom_stop_words <- bind_rows(tibble(word = c("discord"), 
                                          lexicon = c("custom")), 
                               stop_words)
custom_stop_words

## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 discord     custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # … with 1,140 more rows

Wordclouds

Traditional graphs are great for getting a point across, but wordclouds are visually enjoyable and can be quite informative. This code uses functions from wordcloud & wordcloud2 to present the most common words in several different visualizations:

#basic word cloud
bing_word_counts %>%
  anti_join(stop_words) %>%
  with(wordcloud(word, n, max.words = 80,colors = c("#F592AB","#BF408B","#B040BF","#8340BF")))

#word cloud that compares categorical tokens (positive vs negative sentiments)
bing_word_counts %>%
  inner_join(get_sentiments("bing")) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("cyan", "magenta"),
                   max.words = 100)

A wordcloud with a custom shape (using wordcloud2)

#wordcloud with a custom shape
library( wordcloud2 )
plotWords <- bing_word_counts %>%
  anti_join(custom_stop_words) %>%
  arrange( desc(n)) %>%
  top_n( 100,n)

starCloud <- wordcloud2(plotWords, shape='star',size=0.3, color = 'magenta', backgroundColor="skyblue")

The Most Positive Pony

Which main character in My Little Pony has the most positive sentiment? Here the text data is organized such that lines are grouped by character for the 50 most frequent characters in the cartoon. Additionally, the sentiment is analyzed with a different r library and a different sentiment lexicon: the Syuzhet library is used.

Start by organizing the data:

ponies_top50 <- mlp_df %>% 
    group_by( pony ) %>% 
    summarise( count = n(), lines=paste(dialog, collapse="&&") ) %>% 
    mutate(lines = str_split( lines, "&&")) %>%
    arrange( desc( count ) ) %>% 
    top_n( 50, count ) %>%
    unnest( lines ) %>%
    unnest_tokens(word, lines)
unique(ponies_top50$pony)

##  [1] Twilight Sparkle       Rainbow Dash           Pinkie Pie            
##  [4] Applejack              Rarity                 Spike                 
##  [7] Fluttershy             Apple Bloom            Starlight Glimmer     
## [10] Others                 Sweetie Belle          Scootaloo             
## [13] Discord                Trixie                 Princess Celestia     
## [16] Granny Smith           Big McIntosh           Princess Cadance      
## [19] Maud Pie               Princess Luna          Sunburst              
## [22] Shining Armor          Thorax                 Diamond Tiara         
## [25] Flim                   Smolder                Spitfire              
## [28] Flam                   Zecora                 Cheerilee             
## [31] Gallus                 Cutie Mark Crusaders   Sandbar               
## [34] Ocellus                Cozy Glow              Silverstream          
## [37] Yona                   Prince Rutherford      Garble                
## [40] Gilda                  Zephyr Breeze          Cranky Doodle Donkey  
## [43] Rockhoof               Quibble Pants          Young Applejack       
## [46] Daring Do              Lightning Dust         Queen Chrysalis       
## [49] Star Swirl the Bearded Mayor Mare            
## 842 Levels: A. K. Yearling A.K. Yearling Ace Point Ahuizotl Alice ... Zombie ponies

ponies_lineTally <- ponies_top50 %>%
  select( pony, count ) %>%
  group_by( pony ) %>%
  summarise( count = max(count) )
ponies_lineTally

## # A tibble: 50 x 2
##    pony                 count
##    <fct>                <int>
##  1 Apple Bloom           1362
##  2 Applejack             2748
##  3 Big McIntosh           277
##  4 Cheerilee              117
##  5 Cozy Glow               98
##  6 Cranky Doodle Donkey    74
##  7 Cutie Mark Crusaders   103
##  8 Daring Do               67
##  9 Diamond Tiara          150
## 10 Discord                587
## # … with 40 more rows

Use the get_sentiment() function from the syuzhet library to build sentiment scores for 3 lexicons: ‘syuzhet’, ‘bing’, and ‘nrc’

library( syuzhet )
ponies_top50$syuzhet <- get_sentiment(ponies_top50$word, method="syuzhet")
ponies_top50$bing <- get_sentiment(ponies_top50$word, method="bing")
ponies_top50$nrc <- get_sentiment(ponies_top50$word, method="nrc")

Tally the token/word sentiment scores with respect the character (pony)

#group by pony and summarise the sums of the 3 lexicon scores
ponies_sentimentScores <- ponies_top50 %>%
  group_by( pony, count ) %>%
  summarise( syuzhetScore = sum( syuzhet ),
             bingScore = sum( bing ),
             nrcScore = sum( nrc ))
#normalize the scores to account for the number of lines delivered by each character
ponies_sentimentScores <- ponies_sentimentScores %>%
  mutate( syuzhetScore = syuzhetScore/count,
             bingScore = bingScore/count,
             nrcScore = nrcScore/count)
summary( ponies_sentimentScores )

##                    pony        count          syuzhetScore    
##  Apple Bloom         : 1   Min.   :  60.00   Min.   :0.04766  
##  Applejack           : 1   1st Qu.:  91.25   1st Qu.:0.25234  
##  Big McIntosh        : 1   Median : 146.00   Median :0.34897  
##  Cheerilee           : 1   Mean   : 628.88   Mean   :0.37992  
##  Cozy Glow           : 1   3rd Qu.: 537.00   3rd Qu.:0.46687  
##  Cranky Doodle Donkey: 1   Max.   :4745.00   Max.   :1.05417  
##  (Other)             :44                                      
##    bingScore          nrcScore       
##  Min.   :-0.2266   Min.   :-0.09859  
##  1st Qu.: 0.2389   1st Qu.: 0.16586  
##  Median : 0.3031   Median : 0.29200  
##  Mean   : 0.3197   Mean   : 0.32874  
##  3rd Qu.: 0.3654   3rd Qu.: 0.44584  
##  Max.   : 1.0667   Max.   : 0.90598  
##

#pivot the data longer to facilitate plotting the distributions of scores by lexicon
plotData <- ponies_sentimentScores %>%
  pivot_longer(cols = syuzhetScore:nrcScore, names_to = 'lexicon')
#visualize as box plot:
ggplot(plotData, aes(x=lexicon, y=value)) + 
  geom_boxplot(color="#8340BF", fill="#BF408B", alpha=0.2,
               outlier.colour="#B040BF", outlier.fill="#B040BF", outlier.size=5) +
  labs( title = 'Lexicon Scores Compared:', subtitle="distribution of lexicon scores normalized by lines delivered for top 50 characters")

From the summary statistics and box plot figure above, we see that median Syuzhet score is higher than the other lexicons, the Bing lexicon in the least variable and the NRC lexicon is the most variable. However, the distributions are heavily overlapping.
Now to rank the line normalized syuzhet sentiment scores by character to find the most positive ponies:

colourCount = 15
mycolors = colorRampPalette(brewer.pal(50, "PuRd"))(colourCount)

## Warning in brewer.pal(50, "PuRd"): n too large, allowed maximum for palette PuRd is 9
## Returning the palette you asked for with that many colors

plotData <- ponies_sentimentScores %>%
  arrange( desc( syuzhetScore )) %>%
  head( n = 15L )
ordered <- plotData$pony
ggplot(plotData, aes(x=pony, y=syuzhetScore, color='black',fill=factor(pony) )) +  
  geom_bar( stat = 'identity' ) +
  scale_x_discrete( limits = rev(ordered)) +
  coord_flip() +
  scale_fill_manual(values = mycolors ) +
  theme(legend.position="none") +
  labs( title = 'Highest Syuzhet Score', subtitle = 'Top 15 ranked character Syuzhet Scores normalized by #lines delivered')

The figure above plots the top 15 syuzhet sentiment scoring characters with scores normalized by the number of lines delivered. The number of lines delivered was used to normalize the score as a way to qualify how positive each contribution to the narative was. The result is very interesting. Mayer Mare ranks as the most positive character by this scoring system and this makes sense as she plays a maternal role on the show. However, this ranking also rate a villain, Discord, very highly. Perhaps we will have to probe deeper than just single word token sentiments to develop a more sophisticated ranking of positive roles for characters on the show.

Sentiment Analysis with Tidy Data

Introduction

The following code executes the primary example code for chapter 2 of Text Mining with R. Please see the text for details and a full walk through of the code.

This example uses the ‘janeaustenr’ library to explore sentiment analysis with the works of Jane Austen

library(janeaustenr)

The sentiments dataset

get_sentiments( "afinn" )

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

get_sentiments( "bing" )

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

get_sentiments( "nrc" )

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows

Sentiment analysis with inner join

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  tidytext::unnest_tokens(word, text)

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # … with 293 more rows

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)


ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

### Comparing the Three Sentiment Dictionaries

pride_prejudice <- tidy_books %>%
  filter( book == "Pride & Prejudice")
head(pride_prejudice)

## # A tibble: 6 x 4
##   book              linenumber chapter word     
##   <fct>                  <int>   <int> <chr>    
## 1 Pride & Prejudice          1       0 pride    
## 2 Pride & Prejudice          1       0 and      
## 3 Pride & Prejudice          1       0 prejudice
## 4 Pride & Prejudice          3       0 by       
## 5 Pride & Prejudice          3       0 jane     
## 6 Pride & Prejudice          3       0 austen

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(pride_prejudice %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          pride_prejudice %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

The ratio ofnegative to positive words is higher in the Bing lexicon than the NRC lexicon; this accounts for the differences in the plots above.

get_sentiments("nrc") %>% 
     filter(sentiment %in% c("positive", 
                             "negative")) %>% 
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312

get_sentiments("bing") %>% 
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Most Common Positive and Negative Words

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
head(bing_word_counts)

## # A tibble: 6 x 3
##   word   sentiment     n
##   <chr>  <chr>     <int>
## 1 miss   negative   1855
## 2 well   positive   1523
## 3 good   positive   1380
## 4 great  positive    981
## 5 like   positive    725
## 6 better positive    639

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

Custom stop words:

custom_stop_words <- bind_rows(tibble(word = c("miss"), 
                                          lexicon = c("custom")), 
                               stop_words)
head(custom_stop_words)

## # A tibble: 6 x 2
##   word  lexicon
##   <chr> <chr>  
## 1 miss  custom 
## 2 a     SMART  
## 3 a's   SMART  
## 4 able  SMART  
## 5 about SMART  
## 6 above SMART

Wordclouds

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

### Looking at units beyond just words

PandP_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")
PandP_sentences$sentence[2]

## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()
#summarise number of chapters in each book
austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())

## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

what is the most negative chapter?

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  top_n(1) %>%
  ungroup()

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

NLP of MLP: My Little Pony text mining

Sentiment Analysis of My Little Pony naratives

Introduction

Sentiment Analysis with Inner Join

Comparing Sentiment Lexicons

Most Common Positive & Negative Words

Wordclouds

The Most Positive Pony

Sentiment Analysis with Tidy Data

Introduction

The sentiments dataset

Sentiment analysis with inner join

Most Common Positive and Negative Words

Wordclouds

References