The following code processes transcripts from the cartoon show My Little Pony: Friendship is Magic. The text data is availible as .csv files on kaggle.com. There are several .csv files in the data set bundle. However, this demo will only be working with the ‘clean_data.csv’. This file contains line-by-line transcripts of the show with each corresponding episode title, writer and character (pony).
There are four features in the ‘clean_data.csv’:
| Feature | Description | |
|---|---|---|
| 1. | title | episode title |
| 2. | writer | primary writer of the episode |
| 3. | pony | character who delivered dialog line |
| 4. | dialogue | text line of dialogue |
    This demo will follow through the sentiment analysis example code given by chapter 2 in the text ‘Text Mining with R’. The original code is implemented in the section below ‘Sentiment Analysis with Tidy Data’.
    Begin by loading the necessary R libraries and the text data:
library( dplyr )
library( tidyverse )
library( tidytext )
library( textdata )
library( ggplot2 )
library( RColorBrewer )
library( wordcloud )
library( reshape2 )
    The ‘clean_dialog.csv’ has been uploaded to the author’s github account to facilitate loading into the RStudio environment as an R data.frame:
mlpURL <- "https://raw.githubusercontent.com/SmilodonCub/DATA607/master/my-little-pony-transcript/clean_dialog.csv"
mlp_df <- read.csv( mlpURL )
dim( mlp_df )## [1] 36859 4
## [1] "title" "writer" "pony" "dialog"
    There are many general purpose text lexicons available, three that will be used here: ‘afinn’, ‘bing’ and ‘nrc’. Here is a brief look at them:
    This section will explore sentiment as a function of narative time for a subset of My Little Pony episodes. How does sentiment vary over the course of episode dialogue?
    First, wrangle the text data to a ‘tidy’ format:
episodeLines <- mlp_df %>%
group_by( title ) %>% #with respect to episode title:
mutate( id = row_number()) %>% #add a new feature 'id' to enumerate each row of text
group_by( title, id) %>% #with respect to episode title & line of text(id):
summarise( lines = paste(dialog, collapse = "&&")) %>%
#paste all episode lines together delimited by '&&'
mutate( lines = str_split( lines, "&&") ) %>%
#mutate lines to a list of lines
unnest( lines ) %>% #unnest list of lines to one line per row
unnest_tokens(word, lines) #one token/word per line
head(episodeLines)## # A tibble: 6 x 3
## # Groups: title [197]
## title id word
## <fct> <int> <chr>
## 1 28 Pranks Later 1 i'm
## 2 28 Pranks Later 1 so
## 3 28 Pranks Later 1 sorry
## 4 28 Pranks Later 1 i
## 5 28 Pranks Later 1 lost
## 6 28 Pranks Later 1 track
    The data is now organized such that each word, or token of text data is given a row with corresponding features for the line identity (id) and the episode title (title). Now perform an inner join with the ‘bing’ lexicon.
#list of first 6 episode title
first6Episodes <- unique(episodeLines$title)[1:6]
#perform an inner join with the bing lexicon
pony_sentiment <- episodeLines %>%
filter( title %in% first6Episodes ) %>% #subset for the first 8 episodes
inner_join(get_sentiments("bing")) %>% #inner join with 'bing' lexicon
#for each title, tally the sentiment score of the
#tokens in increments of 10 lines
count(title, index = id %/% 10, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
head( pony_sentiment )## # A tibble: 6 x 5
## # Groups: title [197]
## title index negative positive sentiment
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 28 Pranks Later 0 9 2 -7
## 2 28 Pranks Later 1 7 8 1
## 3 28 Pranks Later 2 4 7 3
## 4 28 Pranks Later 3 3 2 -1
## 5 28 Pranks Later 4 3 1 -2
## 6 28 Pranks Later 5 0 1 1
    Visualize sentiment over the course of episode narative:
colourCount = length(unique(pony_sentiment$title))
mycolors = colorRampPalette(brewer.pal(50, "PuRd"))(colourCount)
ggplot(pony_sentiment, aes(index, sentiment,color='black', fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, ncol = 2, scales = "free_x") +
scale_fill_manual( values = mycolors) +
labs( title = 'Sentiment Analysis', subtitle="Sentiment across episode trajectory for the first 6 episodes") Â Â Â Â This visualization shows how the narative sentiment changes over the trajectory of each episode.
    There are numerous sentiment lexicons. This compares the performance of three general purpose tidytext lexicons on the same sample text data:
#combining 2 episodes to have a longer narative to analyze:
doubleEpisode <- c('A Canterlot Wedding - Part 1', 'A Canterlot Wedding - Part 2')
#a dataframe for the first episode:
doubleEpLines1 <- episodeLines %>%
filter( title %in% doubleEpisode[1] )
addlines <- max(doubleEpLines1$id)
#a dataframe for the second that increments the line 'id'
doubleEpLines2 <- episodeLines %>%
filter( title %in% doubleEpisode[2] ) %>%
mutate( id = id + addlines )
#bind the two episodes to one dataframe
doubleEpLines <- rbind( doubleEpLines1, doubleEpLines2)
#inner join with 'afinn' and sum sentiment value for every 10 lines
afinn <- doubleEpLines %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = id %/% 10) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
#inner join with 'bing' and 'nrc' also sum sentiment value for every 10 lines
bing_and_nrc <- bind_rows(doubleEpLines %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
doubleEpLines %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = id %/% 10, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
#bind afinn, bing, and nrc data then visualize the narative sentiments
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment,color='black', fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y") +
scale_fill_manual( values = mycolors[2:4]) +
labs( title = 'Sentiment Analysis', subtitle="Sentiment for the same text with 3 different lexicons")     The general profile of the sentiment data follows the same envelope for the three lexicons (e.g. there is a peak at index 10 for all three data sets). However, there are noticable variations. For instance, the ‘nrc’ lexicon tends to label text data as more positive than ‘afinn’ and ‘bing’. It is important to know that these subtle differences exist and to note in methods sections which lexicon was used so that analysis and results can be reproduceable.
    This code demonstrates the differences in lexicon sentiment criterion:
nrc_sent <- get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(sentiment)
nrc_ratio <- nrc_sent$n[1]/nrc_sent$n[2]
bing_sent <- get_sentiments("bing") %>%
count(sentiment)
bing_ratio <- bing_sent$n[1]/bing_sent$n[2]
cat( 'The +/- ratio for nrc=', nrc_ratio, 'this < the +/- bing =',bing_ratio)## The +/- ratio for nrc= 1.437716 this < the +/- bing = 2.384539
    The above output demonstrates that overall, the bing lexicon has relatively more negative labels than nrc.
    Performing an inner join to label sentiment with tokens is beneficial. Here the manipulation is used to explore the most frequent positive and negative words that appear in My Little Pony Episodes.
bing_word_counts <- episodeLines %>%
group_by( word ) %>% #group with respect to word,
summarise( n = n()) %>% #count a total for each words occurance
inner_join(get_sentiments("bing")) %>% #join bing sentiments
arrange( desc( n )) #arrange in descending order
head(bing_word_counts)## # A tibble: 6 x 3
## word n sentiment
## <chr> <int> <chr>
## 1 like 2021 positive
## 2 well 1717 positive
## 3 right 1492 positive
## 4 good 971 positive
## 5 sorry 650 negative
## 6 best 592 positive
    Visualize the top 15 most frequent positive and negative words.
bing_word_counts %>%
group_by(sentiment) %>%
top_n(15,n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n,color='black', fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip() +
scale_fill_manual( values = mycolors[3:4]) +
labs( title = 'Most Common Words', subtitle="15 most frequent positive and negative words that appear across all transcripts")     From the figure above, it is appearant that the most common positive word sentiments make a much larger contribution than the most common negative words….as one would hope for a children’s (or for the young at heart) cartoon show.
    There is a problem! In the top 10 negative words, ‘discord’ is listed. This is problematic, because Discord is the name of a villain in My Little Pony. We would like to add this nameto a list of stop words. Stop words are words that are to be excluded from further analysis.
#add 'discord' to the stop words...
custom_stop_words <- bind_rows(tibble(word = c("discord"),
lexicon = c("custom")),
stop_words)
custom_stop_words## # A tibble: 1,150 x 2
## word lexicon
## <chr> <chr>
## 1 discord custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # … with 1,140 more rows
    Traditional graphs are great for getting a point across, but wordclouds are visually enjoyable and can be quite informative. This code uses functions from wordcloud & wordcloud2 to present the most common words in several different visualizations:
#basic word cloud
bing_word_counts %>%
anti_join(stop_words) %>%
with(wordcloud(word, n, max.words = 80,colors = c("#F592AB","#BF408B","#B040BF","#8340BF")))#word cloud that compares categorical tokens (positive vs negative sentiments)
bing_word_counts %>%
inner_join(get_sentiments("bing")) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("cyan", "magenta"),
max.words = 100) Â Â Â Â A wordcloud with a custom shape (using wordcloud2)
#wordcloud with a custom shape
library( wordcloud2 )
plotWords <- bing_word_counts %>%
anti_join(custom_stop_words) %>%
arrange( desc(n)) %>%
top_n( 100,n)
starCloud <- wordcloud2(plotWords, shape='star',size=0.3, color = 'magenta', backgroundColor="skyblue")
    Which main character in My Little Pony has the most positive sentiment? Here the text data is organized such that lines are grouped by character for the 50 most frequent characters in the cartoon. Additionally, the sentiment is analyzed with a different r library and a different sentiment lexicon: the Syuzhet library is used.
    Start by organizing the data:
ponies_top50 <- mlp_df %>%
group_by( pony ) %>%
summarise( count = n(), lines=paste(dialog, collapse="&&") ) %>%
mutate(lines = str_split( lines, "&&")) %>%
arrange( desc( count ) ) %>%
top_n( 50, count ) %>%
unnest( lines ) %>%
unnest_tokens(word, lines)
unique(ponies_top50$pony)## [1] Twilight Sparkle Rainbow Dash Pinkie Pie
## [4] Applejack Rarity Spike
## [7] Fluttershy Apple Bloom Starlight Glimmer
## [10] Others Sweetie Belle Scootaloo
## [13] Discord Trixie Princess Celestia
## [16] Granny Smith Big McIntosh Princess Cadance
## [19] Maud Pie Princess Luna Sunburst
## [22] Shining Armor Thorax Diamond Tiara
## [25] Flim Smolder Spitfire
## [28] Flam Zecora Cheerilee
## [31] Gallus Cutie Mark Crusaders Sandbar
## [34] Ocellus Cozy Glow Silverstream
## [37] Yona Prince Rutherford Garble
## [40] Gilda Zephyr Breeze Cranky Doodle Donkey
## [43] Rockhoof Quibble Pants Young Applejack
## [46] Daring Do Lightning Dust Queen Chrysalis
## [49] Star Swirl the Bearded Mayor Mare
## 842 Levels: A. K. Yearling A.K. Yearling Ace Point Ahuizotl Alice ... Zombie ponies
ponies_lineTally <- ponies_top50 %>%
select( pony, count ) %>%
group_by( pony ) %>%
summarise( count = max(count) )
ponies_lineTally## # A tibble: 50 x 2
## pony count
## <fct> <int>
## 1 Apple Bloom 1362
## 2 Applejack 2748
## 3 Big McIntosh 277
## 4 Cheerilee 117
## 5 Cozy Glow 98
## 6 Cranky Doodle Donkey 74
## 7 Cutie Mark Crusaders 103
## 8 Daring Do 67
## 9 Diamond Tiara 150
## 10 Discord 587
## # … with 40 more rows
    Use the get_sentiment() function from the syuzhet library to build sentiment scores for 3 lexicons: ‘syuzhet’, ‘bing’, and ‘nrc’
library( syuzhet )
ponies_top50$syuzhet <- get_sentiment(ponies_top50$word, method="syuzhet")
ponies_top50$bing <- get_sentiment(ponies_top50$word, method="bing")
ponies_top50$nrc <- get_sentiment(ponies_top50$word, method="nrc")Tally the token/word sentiment scores with respect the character (pony)
#group by pony and summarise the sums of the 3 lexicon scores
ponies_sentimentScores <- ponies_top50 %>%
group_by( pony, count ) %>%
summarise( syuzhetScore = sum( syuzhet ),
bingScore = sum( bing ),
nrcScore = sum( nrc ))
#normalize the scores to account for the number of lines delivered by each character
ponies_sentimentScores <- ponies_sentimentScores %>%
mutate( syuzhetScore = syuzhetScore/count,
bingScore = bingScore/count,
nrcScore = nrcScore/count)
summary( ponies_sentimentScores )## pony count syuzhetScore
## Apple Bloom : 1 Min. : 60.00 Min. :0.04766
## Applejack : 1 1st Qu.: 91.25 1st Qu.:0.25234
## Big McIntosh : 1 Median : 146.00 Median :0.34897
## Cheerilee : 1 Mean : 628.88 Mean :0.37992
## Cozy Glow : 1 3rd Qu.: 537.00 3rd Qu.:0.46687
## Cranky Doodle Donkey: 1 Max. :4745.00 Max. :1.05417
## (Other) :44
## bingScore nrcScore
## Min. :-0.2266 Min. :-0.09859
## 1st Qu.: 0.2389 1st Qu.: 0.16586
## Median : 0.3031 Median : 0.29200
## Mean : 0.3197 Mean : 0.32874
## 3rd Qu.: 0.3654 3rd Qu.: 0.44584
## Max. : 1.0667 Max. : 0.90598
##
#pivot the data longer to facilitate plotting the distributions of scores by lexicon
plotData <- ponies_sentimentScores %>%
pivot_longer(cols = syuzhetScore:nrcScore, names_to = 'lexicon')
#visualize as box plot:
ggplot(plotData, aes(x=lexicon, y=value)) +
geom_boxplot(color="#8340BF", fill="#BF408B", alpha=0.2,
outlier.colour="#B040BF", outlier.fill="#B040BF", outlier.size=5) +
labs( title = 'Lexicon Scores Compared:', subtitle="distribution of lexicon scores normalized by lines delivered for top 50 characters") Â Â Â Â From the summary statistics and box plot figure above, we see that median Syuzhet score is higher than the other lexicons, the Bing lexicon in the least variable and the NRC lexicon is the most variable. However, the distributions are heavily overlapping.
    Now to rank the line normalized syuzhet sentiment scores by character to find the most positive ponies:
## Warning in brewer.pal(50, "PuRd"): n too large, allowed maximum for palette PuRd is 9
## Returning the palette you asked for with that many colors
plotData <- ponies_sentimentScores %>%
arrange( desc( syuzhetScore )) %>%
head( n = 15L )
ordered <- plotData$pony
ggplot(plotData, aes(x=pony, y=syuzhetScore, color='black',fill=factor(pony) )) +
geom_bar( stat = 'identity' ) +
scale_x_discrete( limits = rev(ordered)) +
coord_flip() +
scale_fill_manual(values = mycolors ) +
theme(legend.position="none") +
labs( title = 'Highest Syuzhet Score', subtitle = 'Top 15 ranked character Syuzhet Scores normalized by #lines delivered') Â Â Â Â The figure above plots the top 15 syuzhet sentiment scoring characters with scores normalized by the number of lines delivered. The number of lines delivered was used to normalize the score as a way to qualify how positive each contribution to the narative was. The result is very interesting. Mayer Mare ranks as the most positive character by this scoring system and this makes sense as she plays a maternal role on the show. However, this ranking also rate a villain, Discord, very highly. Perhaps we will have to probe deeper than just single word token sentiments to develop a more sophisticated ranking of positive roles for characters on the show.
    The following code executes the primary example code for chapter 2 of Text Mining with R. Please see the text for details and a full walk through of the code.
    This example uses the ‘janeaustenr’ library to explore sentiment analysis with the works of Jane Austen
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,891 more rows
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
tidytext::unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # … with 293 more rows
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x") ### Comparing the Three Sentiment Dictionaries
## # A tibble: 6 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc <- bind_rows(pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y") The ratio ofnegative to positive words is higher in the Bing lexicon than the NRC lexicon; this accounts for the differences in the plots above.
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
head(bing_word_counts)## # A tibble: 6 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip() Custom stop words:
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
head(custom_stop_words)## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100) ### Looking at units beyond just words
PandP_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
PandP_sentences$sentence[2]## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
#summarise number of chapters in each book
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
what is the most negative chapter?
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
top_n(1) %>%
ungroup()## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343