Responses to free-form questions or posts that are free-form must be analyzed in a way that is different from traditional row/ column data. There is a lot of information to be gained by consumer feedback, survey answers and social media posts. However, this data tends not to be clean and tends not to be analyzed using traditional methods. So, there is a question of how to analyze this data.
We first clean the text data. Basic text cleaning was done in the tidy text tutorial. We will build on this knowledge and use the cleaned data from that tutorial.
Next, we need to understand the emotion of the text. This is referred to as a sentiment analysis.
We pick up where we left off in the tidy text tutorial.
library(tidyverse)
library(stringr)
library(tidytext)
library(harrypotter)
There are a variety of dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package contains three sentiment dictionaries (AFINN, bing, and nrc) in the sentiments dataset. These are single word dictionaries. A Google search for each dictionary (lexicon) will help you understand how each was created.
head(sentiments)
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
To see the individual lexicons try the following code
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,865 more rows
To perform sentiment analysis we need to have our data in a tidy format. We walk through tidying the Harry Potter data in the tidy text section.
titles <- c("Philosopher's Stone",
"Chamber of Secrets",
"Prisoner of Azkaban",
"Goblet of Fire",
"Order of the Phoenix",
"Half-Blood Prince",
"Deathly Hallows")
books <- list(philosophers_stone,
chamber_of_secrets,
prisoner_of_azkaban,
goblet_of_fire,
order_of_the_phoenix,
half_blood_prince,
deathly_hallows)
series <- tibble()
for(i in seq_along(titles)) {
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
unnest_tokens(word, text) %>%
mutate(book = titles[i]) %>%
select(book, everything())
series <- rbind(series, clean)
}
series$book <- factor(series$book, levels = rev(titles))
head(series)
## # A tibble: 6 x 3
## book chapter word
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 the
## 2 Philosopher's Stone 1 boy
## 3 Philosopher's Stone 1 who
## 4 Philosopher's Stone 1 lived
## 5 Philosopher's Stone 1 mr
## 6 Philosopher's Stone 1 and
Now lets use the nrc sentiment data set to assess the different sentiments that are represented across the Harry Potter series. We can see that there is a stronger negative presence than positive.
series %>%
right_join(get_sentiments("nrc")) %>%
filter(!is.na(sentiment)) %>%
count(sentiment, sort = TRUE)
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 negative 55093
## 2 positive 37758
## 3 sadness 34878
## 4 anger 32743
## 5 trust 23154
## 6 fear 21536
## 7 anticipation 20625
## 8 joy 13800
## 9 disgust 12861
## 10 surprise 12817
This chart gives an idea of the sentiment in these books. However, books tend to change sentiment as you get farther into the book.
Normally, a book contains roughly 250 words per page. We group the book into roughly 5 page blocks. The sentiment we want to determine is if the section of the text is mostly positive or negative. We will divide up the results by text.
series %>%
group_by(book) %>%
mutate(word_count = 1:n(),
index = word_count %/% 1250 + 1) %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = index , sentiment) %>%
ungroup() %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative,
book = factor(book, levels = titles)) %>%
ggplot(aes(index, sentiment, fill = book)) +
geom_bar(alpha = 0.5, stat = "identity", show.legend = FALSE) +
facet_wrap(~ book, ncol = 2, scales = "free_x")
This shows us how the sentiment of each novel changes as the story progresses.
One of the reasons a business may want to conduct this type of analysis is to determine what customers feel about their product. An easy way to analyze this is through word counts of sentiments.
bing_word_counts <- series %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
## # A tibble: 3,313 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 like positive 2416
## 2 well positive 1969
## 3 right positive 1643
## 4 good positive 1065
## 5 dark negative 1034
## 6 great positive 877
## 7 death negative 757
## 8 magic positive 606
## 9 better positive 533
## 10 enough positive 509
## # ... with 3,303 more rows
We can view this visually to assess the top words for each sentiment:
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ggplot(aes(reorder(word, n), n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip()
Another way to compare the sentiment is to graph the contribution to the sentiment with both positive and negative sentiments on the same graph.
bing_word_counts %>%
filter(n > 300) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill=sentiment)) +
geom_col() +
coord_flip() +
labs(y = "Contribution to Sentiment")
“AFIT Data Science Lab R Programming Guide ·.” Accessed August 3, 2021. Available here.
“NRC Emotion Lexicon.” Accessed August 6, 2021. Available here.
“Introduction to Tidytext.” Accessed August 10, 2021. Available here.
Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach, 2017. Available here.
“Text Mining: Creating Tidy Text · UC Business Analytics R Programming Guide.” Accessed August 3, 2021. Available here.