Recall, in LOGM 682 Module 5, we discussed the basics of text mining and tidy text principles. To further this discussion, we will now take a look at sentiment analysis. Sentiment analysis is the process of extracting an author’s opinion or feeling from a written text, and is often called “opinion mining.” In the most basic sense, this means categorizing a portion of text as either a positive or negative opinion. This can be accomplished at several levels within a text, including individual words, sentences, and paragraphs1. This approach assumes that within your level of analysis (word, sentence, etc.), the opinion that is analyzed applies to only one entity or topic.
Many documents include multiple opinions, both positive and negative, throughout the text. If the basic approach is applied to this scenario, calculating a sentiment “score” could result in a wash, effectively neutralizing the sentiment of the document. For this reason, handling multiple opinions across a single document becomes more complex. This analysis involves the extraction of explicit and implicit meaning from the text. While this is a practical and useful approach, we must first learn to walk before we run. Thus, the focus of this tutorial will be on the basics of sentiment analysis.
Although text mining is a fairly new analytic technique (established in the early 1990s), it is widely used by businesses across the world. Today, we see the popularity of internet shopping increasing. As such, reviews of almost any product or business exist and are monitored by the respective businesses. One way this is being accomplished is through sentiment analysis.
For example, lets take a look at this review on Amazon:
GSI Outdoors Glacier Stainless Minimalist Pot
This review has information, both positive and negative, that the company would be interested in. The customer rated this product “great” overall with four out of five stars. While they say the lid stays on extremely well, it is a double-edged sword, and you can’t seem to get it off when you need to because the seal sticks. They also have negative comments about the spork saying it is useless, and implicitly say that the pot gripper is also useless.
While that short breakdown of the Amazon review was done manually, we effectively did a sentiment analysis. We determined what was good, and what was bad, and could have easily assigned a sentiment score to quantify our conclusion. Manual analysis, however, would not be a feasible approach when dealing with hundreds or thousands of reviews. Thankfully, there are ways that the R programming language can help us automate this process. In our basic analysis, we will be able to search the text for words that contribute to our sentiment score. We’ll talk more about how to do this in a later section.
At this point, you might be asking yourself “How does this apply to the DoD?” A fair question considering the DoD doesn’t use product reviews to make purchase decisions, at least not in large scale acquisitions. The DoD can’t Google a review on how well a new weapon system works like you or I could when we decide to purchase a new car. What the DoD can do is apply sentiment analysis (and other text mining techniques) to analyze how senior leaders or experts feel about something.
For example, in the mainstream media, you have probably heard at some point that a particular acquisition program is either over budget or behind on schedule. Congress then decides that this is a problem, and the acquisition process (or at least aspects of it) is broken. Their answer – Reform! In fact, reform has been the go-to answer for over 50 years. But despite all the past reform efforts, it seems like the story always stays the same and cost and schedule growth remain a problem.
Why is this? Well, dozens of senior leaders and experts in the acquisition field have published their opinions on the matter. A sentiment analysis can be used to examine these publications to identify any existing trends, and possibly help determine why acquisition reform hasn’t seemed to work.
To replicate the analysis performed in this tutorial, the following packages are required:
library(tidyverse)
library(tidytext)
You will also need to download the RDS datafile, data_tb.rds, located here.
data_tb is a tibble containing the opinions of two leading experts, the Honorable Dr. Jamie Morin and the Honorable Frank Kendall III, in the Defense Acquisition field on Acquisition Reform. The opinions are a subset of opinions published in a US Senate Permanent Subcommittee on Investigations, Committee on Homeland Security and Governmental Affairs staff report titled DEFENSE ACQUISITION REFORM: WHERE DO WE GO FROM HERE? A Compendium of Views by Leading Experts. The full staff report can be obtained here.
For your convenience, and to prevent detraction from the focus of this tutorial, the data has already been cleaned and formatted. For a more in-depth look at how this was done, you can download the script Clean.R, however, it is not required to complete the examples throughout this tutorial.
data_tb contains four variables:
DOCUMENT: Name of the document where expert’s opinion was publishedDATE: Date that the document was publishedNAME: Author of the opinion (Last, First)TXT: Text of the author’s opinionTo perform a basic sentiment analysis, the first step is to split, or unnest, the document text into the appropriate level. For our first analysis, we’ll work at the individual word level.
word_tb <- data_tb %>%
unnest_tokens(word, TXT)
Now that the dataframe tokenized by individual words, we can begin our sentiment analysis. The tidytext package contains some useful tools to help us. First is the sentiments dataset which contains three popular lexicons that we can use in our analysis:
afinn: assigns words with a score (between -5 and 5) with negative scores indicating negative sentimentbing: categorizes words into positive and negative categoriesnrc: categorizes words into positive and negative categories, as well as by type of sentiment (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust)The second useful tool that the tidtext package offers is the get_sentiments() function. Lets use get_sentiments() to take a closer look at the three sentiment lexicons.
get_sentiments("afinn")
## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
get_sentiments("bing")
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
With an understanding of the three sentiment lexicons, we can now use them within our sentiment analysis to determine a sentiment score, categorize by positive and negative sentiment, or further categorize by sentiment type.
word_tb dataset?sentiments dataset).Simple word counts are great as a “first look” or during an exploratory analysis, but sentiment usually changes across a document, adding complexity. To visualize this, we can plot sentiment using geom_bar from the ggplot2 package. But first, we need to determine the appropriate level (word, sentence, paragraph) to examine.
Considering the size of the author’s opinions, let’s take a look at the sentiment of each author across each paragraph within the document. We have a small hiccup to account for though: the text in data_tb has been cleaned to the extent that paragraph deliminators (newlines) no longer exist, so unnesting the data by paragraph here won’t work. Instead, we can use our word_tb dataframe and approximate each paragraph is composed of about 50 words.
To accomplish this, we will need to create an index to iterate through, and calculate the sentiment across our “paragraphs.” The bing lexicon will be a good choice for us to use here, allowing us to count and sum the number of positive or negative words in each paragraph.
# Approximate look at paragraphs
word_tb %>%
group_by(NAME) %>%
mutate(word_count = 1:n(),
index = word_count %/% 50 + 1) %>%
inner_join(get_sentiments("bing")) %>%
count(NAME, index = index , sentiment) %>%
ungroup() %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ggplot(aes(index, sentiment, fill = sentiment > 0)) +
geom_bar(alpha = 0.5, stat = "identity", show.legend = FALSE) +
facet_wrap(~ NAME, ncol = 2, scales = "free_x")
Unnest data_tb by sentence.
Can you figure out how to recreate this heatmap showing how the sentiment of each sentence changes throughout the progression of each author’s opinion?
HINTS:
afinn lexicon will be usefulggplot code to create the heatmap: ggplot(aes(index,
factor(NAME, levels = sort(unique(NAME), decreasing = TRUE)),
fill = sentiment)) +
geom_tile(color = "white") +
scale_fill_gradient2() +
scale_x_continuous(labels = scales::percent, expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Opinion Progression", y = "Acq Expert") +
ggtitle("Sentiment of Acquisition Expert Opinion",
subtitle = "Summary of the net sentiment score as the expert's
opinion progresses") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")
The material and exercises covered throughout this tutorial were derived from the following sources: