STA 279 Lab 4

Complete all Questions.

The Goal

We have been learning about the foundations of sentiment analysis. The goal for today is to apply what we have learned, as well as digging a little deeper into what we can do with sentiment analysis.

The Data Set

We are provided with n = 2499 Amazon reviews from a certain company who sells products on Amazon. All of these reviews are for a wireless mouse that has LED lights that allow the user to change the color of the mouse as desired. To load the data, use the following:

# Load the libraries
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(forcats)

# Load the data
AmazonReviews <- read.csv("https://www.dropbox.com/scl/fi/61kcmzazw531qcwv28j52/AmazonReviews.csv?rlkey=krdesfg9luff2h5ajqph6q1bf&st=6ugb1p2f&dl=1")

# Remove a row that is blank (has no review)
AmazonReviews <- data.frame(AmazonReviews[-2086,-c(1,3)])

# Quick Data Cleaning - add ID 
AmazonReviews$ID <- 1:nrow(AmazonReviews)

# The cleaned reviews did not translate contractions well.
# We can fix that manually using this code. 
AmazonReviews$cleaned_review <- gsub("won", "won't", AmazonReviews$cleaned_review)
AmazonReviews$cleaned_review <- gsub("haven", "haven't", AmazonReviews$cleaned_review)

The data set has \(n=2499\) rows and 3 columns.

cleaned_review: This is the written review.
review_score: This is how many stars the review received on Amazon.
ID: an ID number for the review.

The company serving as our client for today has two goals. They want to use the reviews on their wireless mouse to (1) decide if any changes should be made in their product or if (2) there are particular things customers like about the product and therefore the company should make sure to keep these features going forward.

The client would like to do all of this without having to read all 2499 reviews…which is understandable.

To do this, the client would like us to sort the reviews into negative and positive reviews. They would then like us to analyze these two review types and provide them insight about the product.

Lexicons: Bing

If we look at our data, we are not told whether a review is positive or negative. This means that we somehow have to create this label ourselves. But how??

One way to do this is by using sentiment analysis. In other words, we can look at the reviews and pick out negative words or positive words in the reviews, and we can use this to determine if a review is positive or negative. To find words that are positive or negative, we use lexicons.

Lexicons are just data sets that contain words pertaining to specific emotions. There are many lexicons to choose from, but for today, let’s start by exploring the Bing lexicon. The Bing lexicon has a list of words and has tagged each of them as either “positive” or “negative”. We load the Bing lexicon by using the code below.

# Load the lexicon
binglexicon <- get_sentiments("bing")

If you look at binglexicon, you will notice it has two columns. The first column holds the words in the Bing lexicon. The second column, sentiment, states which sentiment is associated with each word. There are only two options for sentiment in this lexicon: positive or negative.

head(binglexicon)

Question 1

How many negative words are in the Bing lexicon?

Okay, great. So far, we have a data set containing some positive words and negative words. What can we do with this?

Remember that our goal is to determine whether a review is positive or negative. This means we want to look at a review and find all the positive and negative words in it. We identify these words using the Bing lexicon. Let’s see how.

Consider the tokenized form of the first Amazon Review:

AmazonReviews[1,] |>

    unnest_tokens(word, cleaned_review)

##    review_score ID    word
## 1             5  1       i
## 2             5  1    wish
## 3             5  1   would
## 4             5  1    have
## 5             5  1  gotten
## 6             5  1     one
## 7             5  1 earlier
## 8             5  1    love
## 9             5  1      it
## 10            5  1     and
## 11            5  1      it
## 12            5  1   makes
## 13            5  1 working
## 14            5  1      in
## 15            5  1      my
## 16            5  1  laptop
## 17            5  1      so
## 18            5  1    much
## 19            5  1  easier

We want to find all the words in this review that are also in the Bing lexicon (in other words, we want to find the positive and negative words in the review). To do this, we just need to add one line of code:

AmazonReviews[1,] |>

    unnest_tokens(word, cleaned_review) |> 
  
    inner_join( binglexicon )

##   review_score ID   word sentiment
## 1            5  1   love  positive
## 2            5  1 easier  positive

NOTE: You’ll notice we get a message we’ve seen before: Joining with by = join_by(word). This is just R complaining when we don’t tell it specifically which column we are using. To fix this, we’ll add by = join_by(word) to the code. This is not necessary for the code to run, only if you want to remove the warning.

AmazonReviews[1,] |>

    unnest_tokens(word, cleaned_review) |> 
  
    inner_join( binglexicon, join_by(word))

##   review_score ID   word sentiment
## 1            5  1   love  positive
## 2            5  1 easier  positive

This code looks at all the rows in the first review and determines that only the words “love” and “easier” are in the Bing lexicon. Because of this, all the other words have been removed! This is what inner_join does: it keeps words that are in both data sets, the first review and the Bing lexicon.

We also get a bonus when we do this. The function inner_join attaches any information on these two words from the Bing lexicon to the two words we kept, which in this case is specifically whether the words are positive or negative.

Question 2

Based on what we have so far, would you classify the first review as positive or negative?

Question 3

How many words from the 2nd Amazon Review (so the second row in the AmazonReviews data set) are also in the Bing lexicon?
Would you classify the second review as positive or negative?

So far we have been doing this one review at a time. This will be tedious if we have a lot of reviews. If we want to keep only the words from ALL the reviews that are in Bing, we can use:

tidy_withBing <- AmazonReviews |>

  unnest_tokens(word, cleaned_review) |>

  inner_join(binglexicon, by = join_by(word))

NOTE: You will notice that when we run commands like inner_join or other merge commands with larger data sets, you get a message output on your screen:

Warning: Detected an unexpected many-to-many relationship between x and y.

It looks intimidating because it’s red, but you can ignore it! This just means that there are some words in Bing that might appear in multiple reviews. This is not a problem! If you want to hide this output, you can change your chunk header to be {r, message = FALSE, warning= FALSE} If you’d like to make this change for all chunks, let Dr. Dalzell know! She can show you how to do this all at once so you don’t have to do it for each chunk individually. As a hint, this will be very helpful for Data Analysis 1!

Classifying Using Sentiment Analyisis

At the moment, we have a data set that contains all the positive and negative words across the 2499 Amazon Reviews. We are going to use this to sort the 2499 reviews into two piles: negative reviews and positive reviews. In other words, we are going to use sentiment analysis to classify, or label, each review as positive or negative.

To sort reviews into the categories positive and negative, we need to create a sentiment score for each review. This is a number that says how positive or negative the review is. Typically, positive scores relate to positive reviews and negative scores relate to negative reviews.

Using the Bing lexicon, we can find and count the number of positive and negative words in each review. This means that to create a sentiment score, we might consider subtracting the number of negative words from the number of positive words:

\[\text{Sentiment Score} = \text{Number of Positive Words} - \text{Number of Negative Words}\]

This means that we need to count the number of positive words and the number of negative words in each review.

Question 4

Consider these two count commands:

# Count Command 1 
positive_counts <- AmazonReviews |>
  unnest_tokens(word, cleaned_review) |>
  inner_join(binglexicon, by = join_by(word)) |>
  filter(sentiment=="positive")|>
  count(ID)

# Count Command 2
positive_counts <- AmazonReviews |>
  unnest_tokens(word, cleaned_review) |>
  inner_join(binglexicon, by = join_by(word)) |>
  filter(sentiment=="positive") |>
  count(word, ID)

The only difference is whether or not we have word inside the parentheses.

Our goal is to count how many positive words appear in each Amazon review. Which of these two options (Command 1 or Command 2) do we need to use?

Go ahead and run whatever command you chose in the previous question. This creates a data set called positive_counts. The data set positive_counts has two columns: ID and n. The column n counts the number of positive words in each review. This is fine…but we are about to count the number of negative words in each review as well. When we do this, we will also have two columns ID and n, and this can get confusing.

Because of this, we are going to rename our column n to n_positive_words. To do this, add this line of code (and a pipe!!) to the end of your code from Question 4.

 rename( n_positive_words = n )

Question 5

Adapt the code from Question 4 to count the number of negative words in each review. Rename the column n to be n_negative_words. As an answer to this question, show the first few rows of negative_counts using:

head(negative_counts)

At this point we have two data sets. One counts the number of positive words in each review, and one counts the number of negative words in each review. We now want to put them together in the same data set so we can subtract and get the final scores!

To do this, we switch from inner_join to full_join. The function inner_join keeps only rows that are in both data sets. The function full_join keeps all the rows, but it adds on the columns in both data sets.

To add the positive_counts data set to our original data (AmazonReviews), we can use:

AmazonReviews_Test <- AmazonReviews |>
  full_join(positive_counts, by = "ID")

Question 6

If you look at the 4th row in AmazonReviews_Test, you will notice that the last column has an NA in it. Why?

To fill in those NAs, we add one more line of code:

AmazonReviews <- AmazonReviews |>
  full_join(positive_counts, by = "ID") |>
  mutate(n_positive_words = replace_na(n_positive_words,0))

The code mutate basically means “change”. We are changing the column n_positive_words by replacing the NAs (replace_na) with 0s.

Question 7

How would you change the code above if you wanted to replace all the NAs with the number 6?

Note: Don’t run this code!! We don’t want all our NAs to be 6.

Question 8

After running the code above Question 6, adapt that code to add on the number of negative words in each review. As an answer to this question, state the number of positive words and the number of negative words in the first review.

Question 9

What should the sentiment score be for the first review?

Hint: \[\text{Sentiment Score} = \text{Number of Positive Words} - \text{Number of Negative Words}\]

At this point, we have everything we need to create a sentiment score! There are a lot of ways to code this, but I like a simple one:

AmazonReviews <- AmazonReviews |>
  mutate( sentimentscore = n_positive_words - n_negative_words)

Question 10

Create an appropriate graph to show the distribution of sentiment score.

Classifying Based on Sentiment Score

Now that we have our sentiment score, the client has asked us to label (classify) each review as positive, negative or neutral.

Question 11

Which ranges of sentiment scores do you think we should use to classify into positive, negative, and neutral? In other words, what sentiment scores should be related to a positive label, a negative label, and a neutral label?

Once we have decided on a range, we can use R to help create a new variable in our data set that indicates the sentiment of each review. For instance,

# Create a New Column
AmazonReviews$sentiment <- NA

# Positive
AmazonReviews$sentiment <- ifelse( AmazonReviews$sentimentscore > 0 , "positive", AmazonReviews$sentiment)

The code above creates a new column in the AmazonReviews data set called sentiment, full of NAs. The second line then says that for all rows with a sentiment score greater than 0, label the review as positive.

Question 12

Add two more lines of code to finish the sentiment labeling with negative and neutral. As the answer to this question, run the code below to show how many positive, negative, and neutral reviews you end up with!

table(AmazonReviews$sentiment)

And there we have it! We have used sentiment analysis to classify each of the over 2000 reviews as positive or negative, without having to read all the reviews!! This is extremely useful in practice!!

But wait a second….

Question 13

On Amazon, reviewers leave 1 to 5 stars to indicate how much they liked the product. In theory, 5 star reviews should be the people who liked the product the most and 1 star reviews are from people who liked the products the least.

The client wants to know if the star reviews are enough to determine sentiment. In other words, could they just label 4-5 star reviews as positive, 3 star reviews as neutral, and 1 -2 star reviews as negative?

To find out, create an appropriate plot to compare our sentiment scores (numeric) with the star reviews (treat this as categorical).

Question 14

Based on your plot in Question 13, does it look like the sentiment of a review is reflected in the number of stars? Explain.
Why do you think this might be true? (There is no right answer to this, just brainstorm)

Using the Classifications

Now that we have classified reviews into positive or negative, our client wants us to find the bi-grams that most distinguish positive reviews and negative reviews. They are hoping this will give them information on (1) how they can improve their product or (2) what users like about their products.

Question 15

Based on the client’s request, should we find (a) the most frequent bi-grams in positive and negative reviews or (b) the bi-grams with the highest TF-IDF in positive and negative reviews?

To tokenize the reviews into bi-grams, we use a very similar code to tokenizing into words:

tidy_bigrams <- AmazonReviews |>
  unnest_tokens(bigram, cleaned_review, token = "ngrams", n = 2)

Removing stop words is a little more complicated in bi-grams than it is for words, however. Bi-grams might have stop words at the beginning of the bi-gram (“and remember”) or at the end (“giving it”). This means that when we remove stop words with bi-grams, we need to consider both of these scenarios and remove any bi grams with a stop word in the beginning or the end of the bigram.

We can do this using the code below:

tidy_bigrams_noStop <- AmazonReviews |>
  unnest_tokens(bigram, cleaned_review, token = "ngrams", n = 2) |> 
  separate(bigram, c("word1", "word2"), sep = " ") |> 
  # Remove any rows where word 1 is a stop word
  filter(!word1 %in% stop_words$word) |>
  # Remove any rows where word 2 is a stop word
  filter(!word2 %in% stop_words$word) |> 
  unite(bigram, word1, word2, sep = " ")

Once we have this, we are able to count the number of times each bi-gram appears! We want to see specifically how bi-grams differ in positive reviews versus negative reviews.

NOTE: When we look at bi-grams, we can remove stop words even with TF-IDF. This means whether we are looking at frequency or high TF-IDF phrases, it’s okay to remove stop words here. There are too many unique combinations with bi-grams for IDF to down-weight them completely.

Question 16

In Question 15, you decided whether we needed to look at (a) the most frequent bi-grams in positive and negative reviews or (b) the bi-grams with the highest TF-IDF in positive and negative reviews.

Based on that decision, which of the following codes do you need to use?

Option 1

bigrams_count <- tidy_bigrams_noStop |>
  count(bigram,sentiment, sort = TRUE)|>
  group_by(sentiment) |>
  slice_max(tf_idf , n = 10 )

Option 2

bigrams_count <- tidy_bigrams_noStop |>
  count(bigram,sentiment, sort = TRUE)|>
  bind_tf_idf(bigram, document = sentiment, n)|>
  group_by(sentiment) |>
  slice_max(tf_idf , n = 10 )

Question 17

Create a plot showing the top bi-grams in each sentiment based on your results in Question 16. A skeleton of the code you need is provided below.

ggplot(bigrams_count, aes(..., reorder_within(bigram , ..., sentiment),fill = sentiment)) + 
  geom_col(fill= "...") +
  labs(x = "...", y = "...") + 
  facet_wrap(~sentiment, ncol = 2, scales = "free_y") +
  scale_y_reordered()

Question 18

Using the plot you have created, explain to the client (1) what you would recommend they improve their product and (2) what users like about their product. Explain how you reached these conclusions based on the plot.

Remember, you are writing to your client in this section, not to me!

References

Lexicons

AFINN: Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903.

bing: Minqing Hu and Bing Liu, ``Mining and summarizing customer reviews.’’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.

Code

The code was adapted from of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.