Sentiment Analysis of My Little Pony naratives

Introduction

    The following code processes transcripts from the cartoon show My Little Pony: Friendship is Magic. The text data is availible as .csv files on kaggle.com. There are several .csv files in the data set bundle. However, this demo will only be working with the ‘clean_data.csv’. This file contains line-by-line transcripts of the show with each corresponding episode title, writer and character (pony).
There are four features in the ‘clean_data.csv’:

Feature Description
1. title episode title
2. writer primary writer of the episode
3. pony character who delivered dialog line
4. dialogue text line of dialogue

    This demo will follow through the sentiment analysis example code given by chapter 2 in the text ‘Text Mining with R’. The original code is implemented in the section below ‘Sentiment Analysis with Tidy Data’.
    Begin by loading the necessary R libraries and the text data:


    The ‘clean_dialog.csv’ has been uploaded to the author’s github account to facilitate loading into the RStudio environment as an R data.frame:

## [1] 36859     4
## [1] "title"  "writer" "pony"   "dialog"


    There are many general purpose text lexicons available, three that will be used here: ‘afinn’, ‘bing’ and ‘nrc’. Here is a brief look at them:




Sentiment Analysis with Inner Join

    This section will explore sentiment as a function of narative time for a subset of My Little Pony episodes. How does sentiment vary over the course of episode dialogue?

    First, wrangle the text data to a ‘tidy’ format:

## # A tibble: 6 x 3
## # Groups:   title [197]
##   title              id word 
##   <fct>           <int> <chr>
## 1 28 Pranks Later     1 i'm  
## 2 28 Pranks Later     1 so   
## 3 28 Pranks Later     1 sorry
## 4 28 Pranks Later     1 i    
## 5 28 Pranks Later     1 lost 
## 6 28 Pranks Later     1 track

    The data is now organized such that each word, or token of text data is given a row with corresponding features for the line identity (id) and the episode title (title). Now perform an inner join with the ‘bing’ lexicon.

## # A tibble: 6 x 5
## # Groups:   title [197]
##   title           index negative positive sentiment
##   <fct>           <dbl>    <dbl>    <dbl>     <dbl>
## 1 28 Pranks Later     0        9        2        -7
## 2 28 Pranks Later     1        7        8         1
## 3 28 Pranks Later     2        4        7         3
## 4 28 Pranks Later     3        3        2        -1
## 5 28 Pranks Later     4        3        1        -2
## 6 28 Pranks Later     5        0        1         1

    Visualize sentiment over the course of episode narative:

    This visualization shows how the narative sentiment changes over the trajectory of each episode.


Comparing Sentiment Lexicons

    There are numerous sentiment lexicons. This compares the performance of three general purpose tidytext lexicons on the same sample text data:

#combining 2 episodes to have a longer narative to analyze:
doubleEpisode <- c('A Canterlot Wedding - Part 1', 'A Canterlot Wedding - Part 2')
#a dataframe for the first episode:
doubleEpLines1 <- episodeLines %>%
  filter( title %in% doubleEpisode[1] )
addlines <- max(doubleEpLines1$id)
#a dataframe for the second that increments the line 'id'
doubleEpLines2 <- episodeLines %>%
  filter( title %in% doubleEpisode[2] ) %>%
  mutate( id = id + addlines )
#bind the two episodes to one dataframe
doubleEpLines <- rbind( doubleEpLines1, doubleEpLines2)

#inner join with 'afinn' and sum sentiment value for every 10 lines
afinn <- doubleEpLines %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = id %/% 10) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
#inner join with 'bing' and 'nrc' also sum sentiment value for every 10 lines
bing_and_nrc <- bind_rows(doubleEpLines  %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          doubleEpLines  %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = id %/% 10, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

#bind afinn, bing, and nrc data then visualize the narative sentiments
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment,color='black', fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  scale_fill_manual( values = mycolors[2:4]) +
  labs( title = 'Sentiment Analysis', subtitle="Sentiment for the same text with 3 different lexicons")

    The general profile of the sentiment data follows the same envelope for the three lexicons (e.g. there is a peak at index 10 for all three data sets). However, there are noticable variations. For instance, the ‘nrc’ lexicon tends to label text data as more positive than ‘afinn’ and ‘bing’. It is important to know that these subtle differences exist and to note in methods sections which lexicon was used so that analysis and results can be reproduceable.

    This code demonstrates the differences in lexicon sentiment criterion:

## The +/- ratio for nrc= 1.437716 this < the +/- bing = 2.384539

    The above output demonstrates that overall, the bing lexicon has relatively more negative labels than nrc.


Most Common Positive & Negative Words

    Performing an inner join to label sentiment with tokens is beneficial. Here the manipulation is used to explore the most frequent positive and negative words that appear in My Little Pony Episodes.

## # A tibble: 6 x 3
##   word      n sentiment
##   <chr> <int> <chr>    
## 1 like   2021 positive 
## 2 well   1717 positive 
## 3 right  1492 positive 
## 4 good    971 positive 
## 5 sorry   650 negative 
## 6 best    592 positive

    Visualize the top 15 most frequent positive and negative words.

    From the figure above, it is appearant that the most common positive word sentiments make a much larger contribution than the most common negative words….as one would hope for a children’s (or for the young at heart) cartoon show.

    There is a problem! In the top 10 negative words, ‘discord’ is listed. This is problematic, because Discord is the name of a villain in My Little Pony. We would like to add this nameto a list of stop words. Stop words are words that are to be excluded from further analysis.

## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 discord     custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # … with 1,140 more rows

The Most Positive Pony

    Which main character in My Little Pony has the most positive sentiment? Here the text data is organized such that lines are grouped by character for the 50 most frequent characters in the cartoon. Additionally, the sentiment is analyzed with a different r library and a different sentiment lexicon: the Syuzhet library is used.

    Start by organizing the data:

##  [1] Twilight Sparkle       Rainbow Dash           Pinkie Pie            
##  [4] Applejack              Rarity                 Spike                 
##  [7] Fluttershy             Apple Bloom            Starlight Glimmer     
## [10] Others                 Sweetie Belle          Scootaloo             
## [13] Discord                Trixie                 Princess Celestia     
## [16] Granny Smith           Big McIntosh           Princess Cadance      
## [19] Maud Pie               Princess Luna          Sunburst              
## [22] Shining Armor          Thorax                 Diamond Tiara         
## [25] Flim                   Smolder                Spitfire              
## [28] Flam                   Zecora                 Cheerilee             
## [31] Gallus                 Cutie Mark Crusaders   Sandbar               
## [34] Ocellus                Cozy Glow              Silverstream          
## [37] Yona                   Prince Rutherford      Garble                
## [40] Gilda                  Zephyr Breeze          Cranky Doodle Donkey  
## [43] Rockhoof               Quibble Pants          Young Applejack       
## [46] Daring Do              Lightning Dust         Queen Chrysalis       
## [49] Star Swirl the Bearded Mayor Mare            
## 842 Levels: A. K. Yearling A.K. Yearling Ace Point Ahuizotl Alice ... Zombie ponies
## # A tibble: 50 x 2
##    pony                 count
##    <fct>                <int>
##  1 Apple Bloom           1362
##  2 Applejack             2748
##  3 Big McIntosh           277
##  4 Cheerilee              117
##  5 Cozy Glow               98
##  6 Cranky Doodle Donkey    74
##  7 Cutie Mark Crusaders   103
##  8 Daring Do               67
##  9 Diamond Tiara          150
## 10 Discord                587
## # … with 40 more rows

    Use the get_sentiment() function from the syuzhet library to build sentiment scores for 3 lexicons: ‘syuzhet’, ‘bing’, and ‘nrc’

Tally the token/word sentiment scores with respect the character (pony)

##                    pony        count          syuzhetScore    
##  Apple Bloom         : 1   Min.   :  60.00   Min.   :0.04766  
##  Applejack           : 1   1st Qu.:  91.25   1st Qu.:0.25234  
##  Big McIntosh        : 1   Median : 146.00   Median :0.34897  
##  Cheerilee           : 1   Mean   : 628.88   Mean   :0.37992  
##  Cozy Glow           : 1   3rd Qu.: 537.00   3rd Qu.:0.46687  
##  Cranky Doodle Donkey: 1   Max.   :4745.00   Max.   :1.05417  
##  (Other)             :44                                      
##    bingScore          nrcScore       
##  Min.   :-0.2266   Min.   :-0.09859  
##  1st Qu.: 0.2389   1st Qu.: 0.16586  
##  Median : 0.3031   Median : 0.29200  
##  Mean   : 0.3197   Mean   : 0.32874  
##  3rd Qu.: 0.3654   3rd Qu.: 0.44584  
##  Max.   : 1.0667   Max.   : 0.90598  
## 

    From the summary statistics and box plot figure above, we see that median Syuzhet score is higher than the other lexicons, the Bing lexicon in the least variable and the NRC lexicon is the most variable. However, the distributions are heavily overlapping.
    Now to rank the line normalized syuzhet sentiment scores by character to find the most positive ponies:

## Warning in brewer.pal(50, "PuRd"): n too large, allowed maximum for palette PuRd is 9
## Returning the palette you asked for with that many colors

    The figure above plots the top 15 syuzhet sentiment scoring characters with scores normalized by the number of lines delivered. The number of lines delivered was used to normalize the score as a way to qualify how positive each contribution to the narative was. The result is very interesting. Mayer Mare ranks as the most positive character by this scoring system and this makes sense as she plays a maternal role on the show. However, this ranking also rate a villain, Discord, very highly. Perhaps we will have to probe deeper than just single word token sentiments to develop a more sophisticated ranking of positive roles for characters on the show.

Sentiment Analysis with Tidy Data

Introduction

    The following code executes the primary example code for chapter 2 of Text Mining with R. Please see the text for details and a full walk through of the code.

    This example uses the ‘janeaustenr’ library to explore sentiment analysis with the works of Jane Austen

The sentiments dataset

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows

Sentiment analysis with inner join

## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # … with 293 more rows

### Comparing the Three Sentiment Dictionaries

## # A tibble: 6 x 4
##   book              linenumber chapter word     
##   <fct>                  <int>   <int> <chr>    
## 1 Pride & Prejudice          1       0 pride    
## 2 Pride & Prejudice          1       0 and      
## 3 Pride & Prejudice          1       0 prejudice
## 4 Pride & Prejudice          3       0 by       
## 5 Pride & Prejudice          3       0 jane     
## 6 Pride & Prejudice          3       0 austen

The ratio ofnegative to positive words is higher in the Bing lexicon than the NRC lexicon; this accounts for the differences in the plots above.

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Most Common Positive and Negative Words

## # A tibble: 6 x 3
##   word   sentiment     n
##   <chr>  <chr>     <int>
## 1 miss   negative   1855
## 2 well   positive   1523
## 3 good   positive   1380
## 4 great  positive    981
## 5 like   positive    725
## 6 better positive    639

Custom stop words:

## # A tibble: 6 x 2
##   word  lexicon
##   <chr> <chr>  
## 1 miss  custom 
## 2 a     SMART  
## 3 a's   SMART  
## 4 able  SMART  
## 5 about SMART  
## 6 above SMART

Wordclouds

### Looking at units beyond just words

## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

what is the most negative chapter?

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

References

  1. Text Mining with R
  2. My Little Pony kaggle.com dataset