STA 279 Lab 5

Complete all Questions.

The Goal

We have been learning about the foundations of sentiment analysis. The goal for today is to apply what we have learned, as well as digging a little deeper into what we can do with sentiment analysis.

The Data

We are provided with n = 2499 Amazon reviews from a certain company that sells products on Amazon. All of these reviews are for a wireless mouse that has LED lights that allow the user to change the color of the mouse as desired. To load the data, use the following:

# Load the libraries
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(forcats)

# Load the data
AmazonReviews <- read.csv("https://www.dropbox.com/scl/fi/61kcmzazw531qcwv28j52/AmazonReviews.csv?rlkey=krdesfg9luff2h5ajqph6q1bf&st=4d0f9put&dl=1")

# Convert to a data frame
AmazonReviews <- data.frame(AmazonReviews)

The data set has \(n=2499\) rows and 2 columns.

cleaned_review: This is the written review.
ID: an ID number for the review.

The company serving as our client for today has two goals. They want to use the reviews on their wireless mouse to (1) decide if any changes should be made in their product or if (2) there are particular things customers like about the product and therefore the company should make sure to keep these features going forward.

This means they would like to proceed in three steps.

Step 1: Figure out which reviews are positive and which reviews are negative.
Step 2: Look at the positive reviews and figure out what people like about the mouse.
Step 3: Look at the negative reviews and figure out what people do NOT like about the mouse so the client can make changes.

Right now, we do not have any way to determine which reviews are positive and which are negative, so we will need to spend some time on Step 1.

One Review

Recall that the first step is to sort the reviews into two categories: positive or negative. To help us see how we might do this, let’s start by considering just one review.

Question 1

Write and run the code needed to tokenize the first review (only the first!!) into words.

We want to classify the review as positive or negative, so we want to look at the words and determine which might contribute to the sentiment of the review. In other words, we want to find words that are positive or negative.

Question 2

Identify any positive words in the first review. If there are none, just write “none”.
Identify any negative words in the first review. If there are none, just write “none”.

As we discussed in class, doing this process manually can be subjective, meaning each of us in the class could get different answers to Question 2 depending on what we think are positive or negative words. This is not ideal for data analysis. Instead, we want to use lexicons to help us label words as positive or negative.

Question 3

What is the definition of a lexicon?

There are a few different lexicons we could use, but let’s start with the Bing lexicon. We can load this lexicon using the code below:

bing <- get_sentiments("bing")

Question 4

How many words are in the Bing lexicon?

Hint: You should not need any code to figure this out.

Question 5

State (a) one positive word and (b) one negative word in the Bing lexicon.

Note: There are many options here, just choose one!

Once we have chosen a lexicon, the next step is to look at our first review and find out which words are also in the Bing lexicon. This means we can let the lexicon decide which words are positive and negative without us having to decide manually.

To figure out which words in the first review are also in Bing, we will use the function inner_join.

AmazonReviews[1,] |>

    unnest_tokens(word, cleaned_review) |>
  
    inner_join(get_sentiments("bing"))

We have seen anti_join before, which means “remove” in R. The function inner_join means “keep”. This means that inner_join( get_sentiments("bing") ) means “keep only words in the text that are in the Bing lexicon”.

Question 6

Which words in the first review are also in the Bing lexicon?

Bing is just one lexicon we could have used to determine which words in the review contribute to the sentiment. If we changed lexicons, would we maybe identify different words as being positive or negative??

To check this, let’s consider another lexicon: AFINN.

Question 7

Which words in the first review are also in the AFINN lexicon?

Hint: This requires changing only one word in the code above Question 6.

This all means that once you choose a lexicon, it is important to note this in your analysis. The sentiment we assign to words depends on the selection of lexicon, and there isn’t really one “right” lexicon to use. We just need to be clear about which one we settled on!

Once we have chosen a lexicon, and we have identified which words in a text are positive or negative, the next step is to construct a sentiment score that expresses how positive or negative a text is. This is what we will use to classify a review as positive or negative as the client asked.

Question 8

What total sentiment score would you assign the first review based on the Bing lexicon?
What total sentiment score would you assign the first review based on the AFINN lexicon?

Once we have a sentiment score, we can classify a text as positive or negative. Generally, texts with scores above 0 are classified as positive and texts with scores below 0 are classified as negative.

Question 9

Do we classify review 1 as positive or negative? Does this classification change depending on which lexicon we use?

For today, we are just looking to see if a review is overall positive or overall negative. However, if we want to measure how positive or how negative, generally I lean towards using AFINN because it gives different positive or negative words different weights. Again, there is no “right” choice, but make sure you are clear about which you pick!

All the reviews

Now that we have seen how we can classify review 1 as positive or negative using sentiment analysis, let’s see if we can label ALL the reviews.

Remember that the first step with review 1 was to keep only the words in review 1 that the Bing lexicon identified as positive or negative. That means that we now need to do this for ALL the reviews.

If we want to keep only the words from ALL the reviews that are in Bing, we can use:

tidy_withBing <- AmazonReviews |>

  unnest_tokens(word, cleaned_review) |>

  inner_join(get_sentiments("bing"))

Side Note: Warnings

NOTE: You will notice that when we run commands like inner_join or other merge commands with larger data sets, you get a message output on your screen:

Warning: Detected an unexpected many-to-many relationship between x and y.

It looks intimidating because it’s red, but you can ignore it! This just means that there are some words in Bing that might appear in multiple reviews. This is not a problem! If you want to hide this output, you can change your code to:

tidy_withBing <- AmazonReviews |>

  unnest_tokens(word, cleaned_review) |>

  inner_join(get_sentiments("bing"), relationship = "many-to-many")

This tells R that we expect words in Bing to appear in multiple reviews, and the warning will go away.

Back to what we were doing

After running the code above, we have a data set called tidy_withBing that contains (1) the ID of the review, (2) a word in Bing, and (3) the sentiment of that word. All of the words in this data set were in the Bing lexicon.

At this point, we want to start building our sentiment score. With Bing,

\[\text{Sentiment Score} = \text{Number of Positive Words} - \text{Number of Negative Words}\] This means we need to count (a) the number of positive words and (b) the number of negative reviews in each review.

Suppose we start with the positive words. The code we need to count the number of positive words in each review is:

positive_withBing <- tidy_withBing |>
  
  filter(sentiment == "positive") |>
  
  count(ID) |>
  
  rename(n_positive = n)

Question 10

The last line of code rename(n_positive = n) is a new one. Run the code above with and without this line (but do NOT print all of that in the lab!!) to see what this line of code does. As an answer to this question, describe briefly what this code is for.

Question 11

Adapt the code above to count the number of negative words (according to Bing) in each review. As an answer to this question, state the number of negative words in the review with ID 6.

For modeling, it helps to add the negative and positive counts as columns on our AmazonReviews data set. We can do this using:

AmazonReviewsNew <- AmazonReviews |>
  full_join(positive_withBing, by = "ID") |>
  full_join(negative_withBing, by = "ID")

Another join command!! So far, we have:

anti_join: remove only certain rows
inner_join: keep only certain rows
full_join: merge two data sets, keeping all rows

Question 12

If you look at the 1st row in AmazonReviewsNew, you will notice that the last column (n_negative) has an NA in it. Why do you think that is?

Hint: We have already played with this row of data!!

It turns out that these NAs are meaningful, meaning that they indicate a specific value. In this case, an NA really means a 0. To fill in those NAs, we add one more line of code:

AmazonReviewsNew <- AmazonReviews |>
  full_join(positive_withBing, by = "ID") |>
  full_join(negative_withBing, by = "ID") |>
  mutate(n_positive = replace_na(n_positive,0)) |>
  mutate(n_negative = replace_na(n_negative,0))

The code mutate basically means “change”. We are changing the column n_positive by replacing the NAs (replace_na) with 0s.

Question 13

How would you change the code above if you wanted to replace all the NAs with the number 6?

Note: Don’t run this code!! We don’t want all our NAs to be 6.

Now we have a data set called AmazonReviewsNew with 4 columns:

cleaned_review: This is the written review.
ID: an ID number for the review.
n_positive: the number of positive words in the review, according to the Bing lexicon.
n_negative: the number of negative words in the review, according to the Bing lexicon.

The next step is to create a column for the total sentiment score for each review. Remember that the sentiment score we need is:

\[\text{Sentiment Score} = \text{Number of Positive Words} - \text{Number of Negative Words}\]

There are a lot of ways to create the sentiment score using code, but I like a simple one:

AmazonReviewsNew <- AmazonReviewsNew |>
  mutate( bing_sentiment = n_positive - n_negative)

Question 14

Run the code above to create a bing sentiment score. As an answer to this question, show the output to:

summary(AmazonReviewsNew$bing_sentiment)

Are more reviews for the mouse positive or negative? How can you tell? Hint: You do not need anything except the output for (a) to answer this!

Classifying Using Sentiment Analyisis

Our next step is to sort the 2499 reviews into two piles: negative reviews and positive reviews. We have created a sentiment scores, and positive sentiment scores are labelled as positive reviews and negative scores are labelled as negative reviews.

We can do this in R using the following code:

AmazonReviewsNew <- AmazonReviewsNew |>
  mutate( label = ifelse(bing_sentiment > 0, "positive","negative"))

The code above creates a new column in the AmazonReviewsNew data set called label that labels a review as positive, negative, or neutral.

Question 15

How many reviews are classified as positive? Negative?

And there we have it! We have used sentiment analysis to classify each of the over 2000 reviews as positive or negative, without having to read all the reviews!! This is extremely useful in practice!!

Using the Classifications

Now that we have classified reviews into positive or negative, our client wants us to find the bi-grams that most distinguish positive reviews and negative reviews. They are hoping this will give them information on (1) how they can improve their product or (2) what users like about their products.

We haven’t done bi-grams yet, but we will get our first taste of them today! To tokenize the reviews into bi-grams, we use a very similar code to tokenizing into words:

tidy_bigrams <- AmazonReviewsNew |>
  unnest_tokens(bigram, cleaned_review, token = "ngrams", n = 2)

Removing stop words is a little more complicated in bi-grams than it is for words, however. Bi-grams might have stop words at the beginning of the bi-gram (“and remember”) or at the end (“giving it”). This means that when we remove stop words with bi-grams, we need to consider both of these scenarios and remove any bi grams with a stop word in the beginning or the end of the bigram.

We can do this using the code below:

tidy_bigrams_noStop <- AmazonReviewsNew |>
  unnest_tokens(bigram, cleaned_review, token = "ngrams", n = 2) |> 
  separate(bigram, c("word1", "word2"), sep = " ") |> 
  # Remove any rows where word 1 is a stop word
  filter(!word1 %in% stop_words$word) |>
  # Remove any rows where word 2 is a stop word
  filter(!word2 %in% stop_words$word) |> 
  unite(bigram, word1, word2, sep = " ")

Once we have this, we are able to count the number of times each bi-gram appears! We want to see specifically how bi-grams differ in positive reviews versus negative reviews.

NOTE: When we look at bi-grams, we can remove stop words even with TF-IDF. This means whether we are looking at frequency or high TF-IDF phrases, it’s okay to remove stop words here. There are too many unique combinations with bi-grams for IDF to down-weight them completely.

Question 16

Based on the client’s request, should we find (a) the most frequent bi-grams in positive and negative reviews or (b) the bi-grams with the highest TF-IDF in positive and negative reviews?

Question 17

In Question 16, you decided whether we needed to look at (a) the most frequent bi-grams in positive and negative reviews or (b) the bi-grams with the highest TF-IDF in positive and negative reviews.

Based on that decision, which of the following codes do you need to use?

Option 1

bigrams_count <- tidy_bigrams_noStop |>
  count(bigram,label, sort = TRUE)|>
  group_by(label) |>
  slice_max(n , n = 10 )

Option 2

bigrams_count <- tidy_bigrams_noStop |>
  count(bigram,label, sort = TRUE)|>
  bind_tf_idf(bigram, document = label, n)|>
  group_by(label) |>
  slice_max(tf_idf , n = 10 )

Question 18

Create a plot showing the top bi-grams in each sentiment based on your results in Question 17. A skeleton of the code you need is provided below.

ggplot(bigrams_count, aes(..., reorder_within(bigram , ..., label),fill = label)) + 
  geom_col(fill= "...") +
  labs(x = "...", y = "...") + 
  facet_wrap(~ label, ncol = 2, scales = "free_y") +
  scale_y_reordered()

Question 19

Using the plot you have created, explain to the client (1) what you would recommend they improve their product and (2) what users like about their product. Explain how you reached these conclusions based on the plot.

Remember, you are writing to your client in this section, not to me!

The Code (for your data analysis)

We had a lot of code today, so here it is in one place.

Creating a Bing sentiment score

positive_withBing <- AmazonReviews |>

  unnest_tokens(word, cleaned_review) |>

  inner_join(get_sentiments("bing"), relationship = "many-to-many") |>

  filter(sentiment == "positive") |>
  
  count(ID) |>
  
  rename(n_positive = n)
  
  
negative_withBing <- AmazonReviews |>

  unnest_tokens(word, cleaned_review) |>

  inner_join(get_sentiments("bing"), relationship = "many-to-many") |>

  filter(sentiment == "negative") |>
  
  count(ID) |>
  
  rename(n_negative = n) 
  
AmazonReviews <- AmazonReviews |>
  full_join(positive_withBing, by = "ID") |>
  full_join(negative_withBing, by = "ID") |>
  mutate(n_positive = replace_na(n_positive,0)) |>
  mutate(n_negative = replace_na(n_negative,0)) |>
  mutate( bing_sentiment = n_positive - n_negative)

Creating an AFINN sentiment score

sentiment_withAFINN <- AmazonReviews |>

  unnest_tokens(word, cleaned_review) |>

  inner_join(get_sentiments("afinn"), relationship = "many-to-many") |>
  
  group_by(ID) |>

  summarize(sentiment = sum(value))
  
AmazonReviews <- AmazonReviews |>
  full_join(sentiment_withAFINN, by = "ID") |>
  mutate(sentiment = replace_na(sentiment,0))

References

Lexicons

AFINN: Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903.

bing: Minqing Hu and Bing Liu, ``Mining and summarizing customer reviews.’’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.

Code

The code was adapted from of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.