Introduction

Responses to free-form questions or posts that are free-form must be analyzed in a way that is different from traditional row/ column data. There is a lot of information to be gained by consumer feedback, survey answers and social media posts. However, this data tends not to be clean and tends not to be analyzed using traditional methods. So, there is a question of how to analyze this data.

We first clean the text data. Basic text cleaning was done in the tidy text tutorial. We will build on this knowledge and use the cleaned data from that tutorial.

Next, we need to understand the emotion of the text. This is referred to as a sentiment analysis.

Replication

We pick up where we left off in the tidy text tutorial.

library(tidyverse)      
library(stringr)        
library(tidytext)       
library(harrypotter)    

Sentiment Data Sets

There are a variety of dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package contains three sentiment dictionaries (AFINN, bing, and nrc) in the sentiments dataset. These are single word dictionaries. A Google search for each dictionary (lexicon) will help you understand how each was created.

head(sentiments)
## # A tibble: 6 x 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative

To see the individual lexicons try the following code

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,865 more rows

Sentiment Analysis

To perform sentiment analysis we need to have our data in a tidy format. We walk through tidying the Harry Potter data in the tidy text section.

titles <- c("Philosopher's Stone", 
            "Chamber of Secrets", 
            "Prisoner of Azkaban",
            "Goblet of Fire", 
            "Order of the Phoenix", 
            "Half-Blood Prince",
            "Deathly Hallows")

books <- list(philosophers_stone, 
              chamber_of_secrets, 
              prisoner_of_azkaban,
              goblet_of_fire, 
              order_of_the_phoenix, 
              half_blood_prince,
              deathly_hallows)
  
series <- tibble()

for(i in seq_along(titles)) {
        
  clean <- tibble(chapter = seq_along(books[[i]]),
                  text = books[[i]]) %>%
    unnest_tokens(word, text) %>%
    mutate(book = titles[i]) %>%
    select(book, everything())

    series <- rbind(series, clean)
}

series$book <- factor(series$book, levels = rev(titles))

head(series)
## # A tibble: 6 x 3
##   book                chapter word 
##   <fct>                 <int> <chr>
## 1 Philosopher's Stone       1 the  
## 2 Philosopher's Stone       1 boy  
## 3 Philosopher's Stone       1 who  
## 4 Philosopher's Stone       1 lived
## 5 Philosopher's Stone       1 mr   
## 6 Philosopher's Stone       1 and

Now lets use the nrc sentiment data set to assess the different sentiments that are represented across the Harry Potter series. We can see that there is a stronger negative presence than positive.

series %>%
  right_join(get_sentiments("nrc")) %>%
  filter(!is.na(sentiment)) %>%
  count(sentiment, sort = TRUE)
## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 negative     55093
##  2 positive     37758
##  3 sadness      34878
##  4 anger        32743
##  5 trust        23154
##  6 fear         21536
##  7 anticipation 20625
##  8 joy          13800
##  9 disgust      12861
## 10 surprise     12817

This chart gives an idea of the sentiment in these books. However, books tend to change sentiment as you get farther into the book.

Normally, a book contains roughly 250 words per page. We group the book into roughly 5 page blocks. The sentiment we want to determine is if the section of the text is mostly positive or negative. We will divide up the results by text.

series %>%
  group_by(book) %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 1250 + 1) %>% 
  inner_join(get_sentiments("bing")) %>%
  count(book, index = index , sentiment) %>%
  ungroup() %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative,
         book = factor(book, levels = titles)) %>%
  ggplot(aes(index, sentiment, fill = book)) +
  geom_bar(alpha = 0.5, stat = "identity", show.legend = FALSE) +
  facet_wrap(~ book, ncol = 2, scales = "free_x")

This shows us how the sentiment of each novel changes as the story progresses.

Sentiments for a Business Review

One of the reasons a business may want to conduct this type of analysis is to determine what customers feel about their product. An easy way to analyze this is through word counts of sentiments.

bing_word_counts <- series %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts
## # A tibble: 3,313 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 like   positive   2416
##  2 well   positive   1969
##  3 right  positive   1643
##  4 good   positive   1065
##  5 dark   negative   1034
##  6 great  positive    877
##  7 death  negative    757
##  8 magic  positive    606
##  9 better positive    533
## 10 enough positive    509
## # ... with 3,303 more rows

We can view this visually to assess the top words for each sentiment:

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ggplot(aes(reorder(word, n), n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

Another way to compare the sentiment is to graph the contribution to the sentiment with both positive and negative sentiments on the same graph.

bing_word_counts %>%
  filter(n > 300) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill=sentiment)) +
  geom_col() +
  coord_flip() +
  labs(y = "Contribution to Sentiment")

Citations

“AFIT Data Science Lab R Programming Guide ·.” Accessed August 3, 2021. Available here.

“NRC Emotion Lexicon.” Accessed August 6, 2021. Available here.

“Introduction to Tidytext.” Accessed August 10, 2021. Available here.

Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach, 2017. Available here.

“Text Mining: Creating Tidy Text · UC Business Analytics R Programming Guide.” Accessed August 3, 2021. Available here.