Assignment 10

Introduction

In this assignment we were asked to walk through an example sentiment analysis from Text Mining with R, A Tidy Approach by Julia Silge and David Robinson, 2020-03-07 (https://www.tidytextmining.com/index.html). Then, we were instructed to run a similar analysis on a different corpus, and to incorporate at least one additional sentiment lexicon.

Example Code

library(janeaustenr)
library(dplyr)
library(stringr)
library(tidytext)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

head(tidy_books)

## # A tibble: 6 x 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen

The author begins with a text file conjured by a function called austen_books(), from the package janeaustenr, a repository of Jane Austen’s work. They then grouped by book, extracted line and chapter number and then broke each line into individual tokens, in this case words.

library(textdata)

## Warning: package 'textdata' was built under R version 3.6.3

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

Using the words associated with joy from the nrc data set (This data set was published in Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465. ) they then were able to view which words associated with joy were most commonly used in Austen’s novel Emma.

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

For the next analysis, the author wanted to find the change in sentiment using the ‘bing’ sentiment db, across a number of Austen’s novels. To accomplish this, they broke each of her novels into 80 line segments, and measure the aggregate sentiment in each segment. Plotting the sentiment on the Y Axis, and the segment on the X, they were able to measure the change in sentiment over time.

Student Analysis - bing

To contrast the above example, I wanted to select a more modern corpus. I am fan of the NBC show, “The Office” (having been on winning teams in more than one “The Office” trivia night), and thought that this would be an interesting subject to explore. I was delighted to find that there was an existing R package with the scripts for the entire series.

library(schrute)

## Warning: package 'schrute' was built under R version 3.6.3

The_Office <- schrute::theoffice
The_Office <- tibble(The_Office$season, The_Office$episode, The_Office$text)
The_Office <- The_Office %>%
  unnest_tokens(word, The_Office$text)
The_Office <- The_Office %>% rename("Season" = `The_Office$season`,"Episode" = `The_Office$episode`)

In the Jane Austen analysis, they broke each of her novels into 80 line sections. For a television series, I thought it would be more appropriate to explore the relative sentiment of each episode, and track the change in estimate through each season.

The_Office_Table <- The_Office %>% 
    inner_join(get_sentiments("bing")) %>%
    count(Season, Episode, sentiment) %>%
    spread(sentiment, n, fill = 0) %>%
    mutate(sentiment = positive - negative)

## Joining, by = "word"

head(The_Office_Table)

## # A tibble: 6 x 5
##   Season Episode negative positive sentiment
##    <int>   <int>    <dbl>    <dbl>     <dbl>
## 1      1       1       74      107        33
## 2      1       2       63      153        90
## 3      1       3       67      128        61
## 4      1       4       65      163        98
## 5      1       5       57      127        70
## 6      1       6       51      173       122

ggplot(The_Office_Table, aes(Episode, sentiment, fill = Season)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Season, ncol = 3, scales = "free_x")

As you can see above, there are a few interesting findings that stand out right away. The first is that our underlying data set seems to incomplete. After spot-checking a few values, the episodes featuring a zero sentiment are actually null values in the The_Office_Table tibble. The next observation is that there are no episodes that feature a negative sentiment. While The Office is known for its awkward moments, at the end of the day it is meant to be situational comedy, so this is to be expected. I would have anticipated that the first season of the series would have been considerably lower in sentiment do to its more awkward nature (the first season was more heavily based on the BBC series of the same name, and thus, known to be less uplifting than later seasons). Finally, it is notable how many season finales have a very high sentiment rating.

Student Analysis - nrc

One of things that fans of the show appreciate is how it leverages awkward moments to build tension before releasing in either comedy or happiness. For my next analysis, I wanted to track the use of the Joy and Disgust sentiment as defined by the nrc data set:

nrc_disgust <- get_sentiments("nrc") %>% 
  filter(sentiment == "disgust")

The_Office_disgust <- The_Office %>% 
    inner_join(nrc_disgust) %>%
    count(Season, Episode, sentiment) %>%
    spread(sentiment, n, fill = 0)

## Joining, by = "word"

head(The_Office_disgust)

## # A tibble: 6 x 3
##   Season Episode disgust
##    <int>   <int>   <dbl>
## 1      1       1      20
## 2      1       2      22
## 3      1       3      31
## 4      1       4      20
## 5      1       5      23
## 6      1       6      20

In our last analysis, using the bing data set, we had a natural mechanism for normalizing our data set for size. By subtracting negative emotions from positive emotions, we looked at net emotion. In this analysis, however, we are merely counting words that conjure disgust. As a result, a better analysis is to look at the percentage of words that conjure disgust, as episodes with more dialog would mask any findings about the tone of the episode.

I elected to go back to our original The_Office data set, count the number of words and the divide the words which conjure disgust by the total dialog in the episode.

The_Office_N <- The_Office %>% count(Season,Episode)

The_Office_disgust <- inner_join(The_Office_disgust, The_Office_N)

## Joining, by = c("Season", "Episode")

The_Office_disgust$disgustpercent <- The_Office_disgust$disgust/The_Office_disgust$n
The_Office_disgust

## # A tibble: 186 x 5
##    Season Episode disgust     n disgustpercent
##     <int>   <int>   <dbl> <int>          <dbl>
##  1      1       1      20  2777        0.00720
##  2      1       2      22  2828        0.00778
##  3      1       3      31  2783        0.0111 
##  4      1       4      20  2955        0.00677
##  5      1       5      23  2481        0.00927
##  6      1       6      20  3047        0.00656
##  7      2       1      20  2835        0.00705
##  8      2       2      36  3074        0.0117 
##  9      2       3      12  2611        0.00460
## 10      2       4      25  2759        0.00906
## # ... with 176 more rows

ggplot(The_Office_disgust, aes(Episode, disgustpercent, fill = Season)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Season, ncol = 3, scales = "free_x")

What stands out above, is how the show tends to have several episodes that lean more heavily on this tactic of crafting anxiety or other negative emotions, before inevitably releasing this tension. The season premier of season 5 seems to stand out as an outlier, however, there are many other episodes that tower above the rest in this metric.

The_Office_joy <- The_Office %>% 
    inner_join(nrc_joy) %>%
    count(Season, Episode, sentiment) %>%
    spread(sentiment, n, fill = 0)

## Joining, by = "word"

The_Office_joy <- inner_join(The_Office_joy, The_Office_N)

## Joining, by = c("Season", "Episode")

The_Office_joy$joypercent <- The_Office_joy$joy/The_Office_joy$n

ggplot(The_Office_joy, aes(Episode, joypercent, fill = Season)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Season, ncol = 3, scales = "free_x")

Regarding joy, it is interesting how this does not mirror the “bing” sentiment analysis we preformed in our first example. That being said - we see some particularly joyous episodes early in the shows tenure. S1E4, S2E19 and S2E23 (the season finale) skew our results fairly substantially. The peaks in the later seasons are relatively subdued when compared to these episodes.

Student Analysis AFINN

get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

As afinn uses a value system per word, we can take the aggregate sum of these values to determine the shows sentiment equivalent.

The_Office_afinn <- The_Office %>% 
    inner_join(get_sentiments("afinn")) %>%
      group_by(Season, Episode) %>% 
      summarise(value = sum(value))

## Joining, by = "word"

head(The_Office_afinn)

## # A tibble: 6 x 3
## # Groups:   Season [1]
##   Season Episode value
##    <int>   <int> <dbl>
## 1      1       1   108
## 2      1       2   191
## 3      1       3   136
## 4      1       4   232
## 5      1       5   116
## 6      1       6   243

ggplot(The_Office_afinn, aes(Episode, value, fill = Season)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Season, ncol = 3, scales = "free_x")

As expected, this looks similar but not identical to our “bing” sentiment analysis above. I would imagine this has to do with slight differences in weighting between these two systems.

Citation

Silge, Julia, and David Robinson. Text mining with R: A tidy approach. " O’Reilly Media, Inc.", 2017.

Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465