STA 279 Lab 4

Complete all Questions.

The Goal

We have been learning about the foundations of sentiment analysis. The goal for today is to apply what we have learned, as well as digging a little deeper into what we can do with sentiment analysis.

The Data Set

We are provided with n = 2499 Amazon reviews from a certain company who sells products on Amazon. All of these reviews are for a wireless mouse that has LED lights that allow the user to change the color of the mouse as desired. To load the data, use the following:

# Load the libraries
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(forcats)

# Load the data
AmazonReviews <- read.csv("https://www.dropbox.com/scl/fi/61kcmzazw531qcwv28j52/AmazonReviews.csv?rlkey=krdesfg9luff2h5ajqph6q1bf&st=6ugb1p2f&dl=1")
AmazonReviews <- data.frame(AmazonReviews[-2086,-c(1,3)])
AmazonReviews$ID <- 1:nrow(AmazonReviews)

The data set has \(n=2499\) rows and 3 columns.

cleaned_review: This is the written review.
review_score: This is how many stars the review received on Amazon.
ID: an ID number for the review.

Lexicons: Bing

The company serving as our client for today wants to use the reviews on their wireless mouse to decide if any changes should be made in their product, or if there are particular things customers like about the product and therefore the company should make sure to keep these features going forward.

One way to approach this question is by using sentiment analysis. In other words, we can look at the reviews and pick out negative words or positive words that occur fairly often in the reviews. These can tell us things about the mouse that customers particularly like or dislike!

When we perform sentiment analysis, we generally look at the words in the text and determine what emotion each word conveys. Is it a positive word? A negative word? A word about surprise? To determine what emotion a word expresses, we use lexicons.

Lexicons are just data sets that contain words pertaining to specific emotions. There are many lexicons to choose from, but for today, let’s start by exploring the Bing lexicon. The Bing lexicon has a list of words and has tagged each of them as either “positive” or “negative”. We can see this if we run the code below:

# Load the lexicon
binglexicon <- get_sentiments("bing")

The command get_sentiments("bing") essentially tells R to access the data set, but on its own this command just prints out the lexicon. So, we use binglexicon <- to store the lexicon as a data set named binglexicon.

head(binglexicon)

If you look at binglexicon, you will notice it has two columns. The first column holds the words in the bing lexicon. The second column, sentiment, states which sentiment is associated with each word. There are only two options for sentiment in this lexicon: positive or negative.

Question 1

How many negative words are in the Bing lexicon?

Okay, great. So far, we have a data set containing some positive words and negative words. What we would like to do is see how many of these positive and negative words actually occur in our Amazon reviews.

The first step in this process is to tokenize our Amazon reviews.

tidy_Amazon <- AmazonReviews |>
  # Tokenize the reviews 
  unnest_tokens(word, cleaned_review)

What we now want to do is look at the word column in tidy_Amazon, which holds all the words in the Amazon Reviews. We want to check to see if each of these words is also listed in the Bing lexicon, meaning they are positive or negative (according to our lexicon). If the word is in Bing, we want to (1) keep the word and (2) label it as positive or negative. If the word is not in Bing, we want (3) to remove the word from tidy_Amazon, because it won’t help us on our quest to find positive and negative words.

This sounds tedious…but luckily we can do it all using one function called inner_join. For example, consider the tokenized form of the first Amazon Review, tidy_Amazon1:

##    review_score ID    word
## 1             5  1       i
## 2             5  1    wish
## 3             5  1   would
## 4             5  1    have
## 5             5  1  gotten
## 6             5  1     one
## 7             5  1 earlier
## 8             5  1    love
## 9             5  1      it
## 10            5  1     and
## 11            5  1      it
## 12            5  1   makes
## 13            5  1 working
## 14            5  1      in
## 15            5  1      my
## 16            5  1  laptop
## 17            5  1      so
## 18            5  1    much
## 19            5  1  easier

We want to find all the words in this review that are also in Bing. To do this, we run the following:

inner_join( tidy_Amazon1, binglexicon)

## Joining with `by = join_by(word)`

##   review_score ID   word sentiment
## 1            5  1   love  positive
## 2            5  1 easier  positive

You’ll notice we get a message we have message we’ve seen before (which is just R complaining when we don’t tell it specifically which column we are using). To fix this, we’ll add by = join_by(word) to the code. This is not necessary for the code to run, only if you want to remove the warning.

inner_join( tidy_Amazon1, binglexicon, join_by(word))

##   review_score ID   word sentiment
## 1            5  1   love  positive
## 2            5  1 easier  positive

So, this code looks at all the rows in the first review, and determined that only the words “love” and “easier” are in the Bing lexicon. Because of this, all the other words in tidy_Amazon1 have been removed.

We also get a bonus. The function inner_join attaches any information on these two words from the Bing lexicon to the two words we kept, which in this case is specifically whether the words are positive or negative.

So, the function inner_join looks at the word column in both the tokenized text data and in the lexicon (binglexicon). If a word appears in both data sets, we keep the word, and we add onto our tokenized review any columns of information about the word we can get from binglexicon.

Question 2

How many words from the 2nd Amazon Review (so the second row in the AmazonReviews data set) are also in the Bing lexicon?

So far we have been doing this one review at a time. If we want to keep only the words from all the reviews that are in Bing, we can use:

tidy_bing <- AmazonReviews |>
  # Tokenize the reviews 
  unnest_tokens(word, cleaned_review) |>
  # Keep only words in the reviews that 
  # are also in the Bing lexicon
  # and label them positive or negative
  inner_join(get_sentiments("bing"), by = join_by(word))

NOTE: You will notice that when we run commands like inner_join or other merge commands with larger data sets, you get a message output on your screen:

Warning: Detected an unexpected many-to-many relationship between x and y.

It looks intimidating because it’s red, but you can ignore it! This just means that there are some words in Bing that might appear in multiple reviews. This is not a problem! If you want to hide this output, you can change your chunk header to be {r, message = FALSE, warning= FALSE} If you’d like to make this change for all chunks, let Dr. Dalzell know! She can show you how to do this all at once so you don’t have to do it for each chunk individually. As a hint, this will be very helpful for Data Analysis 1!

Positive Words

At the moment, we have a data set that contains all the positive and negative words across the 2499 Amazon Reviews. This doesn’t really help our client though - there are over 8000 such words!! Instead, our client asks us which positive words happen most frequently across the data set, with a goal of determining what customers like and don’t like about the wireless mouse.

Question 3

Consider these two count commands:

# Count Command 1 
positive_counts <- AmazonReviews |>
  unnest_tokens(word, cleaned_review) |>
  inner_join(get_sentiments("bing"), by = join_by(word)) |>
  filter(sentiment=="positive")
  count()

# Count Command 2
positive_counts <- AmazonReviews |>
  unnest_tokens(word, cleaned_review) |>
  inner_join(get_sentiments("bing"), by = join_by(word)) |>
  filter(sentiment=="positive") |>
  count(word)

The only difference is whether or not we have word inside the parentheses.

Our goal is to count how many times each positive word appears in the Amazon reviews. Which of these two options (Command 1 or Command 2) do we want?
Run the line of code you chose in (a) and state which positive word occurs most often in the reviews. Hint: Remember that if you want the most frequently occurring word to appear on the top of your result, you need to add sort = TRUE to your count command. This means either count(sort = TRUE) or count(word, sort = TRUE), depending on which option you chose.

Question 4

Create a plot showing the top 10 positive words that occur in the Amazon Reviews. Put the word count on the x axis and the word on the y axis. A skeleton code is provided below.

ggplot( .... , aes( x = ..., y = fct_reorder( word, ...))) +
    geom_col( fil = ...)

If you look at the plot in Question 4, we notice two issues.

Issue 1: The word “works” appears in a few forms: “work”, “works”, and “worked”. All of this tells our client the same thing…the mouse works.

Issue 2: Some of the words, while positive, don’t really help our client. Our client wants to know what features of the mouse the customers like. This means that the words “like” and “love”, for instance, are not helpful.

Before we proceed with our analysis, let’s handle these two issues.

Issue 1: All the work words

With multiple words telling the client that the mouse works, we can actually combine all of these words into one word: “works”. There are a lot of ways to do this, but one of these is to use a str_replace_all command. The structure of this code is:

str_replace_all( in this text, "replace this word", "with this word")

In other words, this commands finds a certain word and replaces every occurrence of that word with another word you chose. For instance, consider the phrase: “working is a lot of work”. If we want to replace the word “working” with “work”, we run:

sentence <- "working is a lot of work!"
str_replace_all(sentence, "working", "work")

## [1] "work is a lot of work!"

You will notice that the word “working” is now “work”!

This means that to replace all “working” and “work” in our Amazon Reviews with “works”, we can use this code! The only slight difference is that because our text is stored in a column, we need to also use mutate. The command mutate let’s us add or change a column in a data set.

mutate( columnName = str_replace_all( columnName, "replace this word", "with this word")

The code above tells R to look in the columnName column in a data set. For every line of text in that column, R will replace the word we specify with a new word.

tidy_AmazonReviews <- AmazonReviews |>
  # Tokenize
  unnest_tokens(word, cleaned_review) |>
  
  # worked become "works"
  mutate(word = str_replace_all(word, "worked","works")) |>
  
  # work becomes "works"
  mutate(word = str_replace_all(word, "work","works"))

Question 5

Adapt your code from Question 4 to show the top 10 positive words after combining “work”, “worked”, and “works”. In other words, create a plot showing the top 10 positive words that occur in the Amazon Reviews now that “work”, “worked”, and “works” are all combined.

Issue 2: Like and Love

With the first issue handled, let’s move onto issue 2. Some of the words, while positive, don’t really help our client. Our client wants to know what features of the mouse the customers like. This means that the word “like” and “love”, for instance, are not helpful.

In this type of situation, we can choose to exclude the words like” and “love” from consideration. They don’t help our client, so let’s not worry about them. We can remove rows with certain words using filter. Specifically, if we want to remove the word “like”, we can use filter( word != "like"). The != part of the code means not equal to. So, we keep all words that are not the word “like”!

AmazonReviews |>
  unnest_tokens(word, cleaned_review) |>
  
  # (NEW!!!!) Remove the word like 
  filter(word != "like") |>
  
  # worked become "works"
  mutate(word = str_replace_all(word, "worked","works")) |>
  
  # work becomes "works"
  mutate(word = str_replace_all(word, "work","works")) |>
  
  # Keep only the positive words
  inner_join(get_sentiments("bing")) |>
  filter( sentiment == "positive") |>

  # Count the number of times each word appears
  count(word) |>
  
  # Show only the top 10
  slice_max(n , n = 10)

Question 6

Adapt the code above to also remove the word “love” from the data set. Show the top 10 words after removing “like” and “love”.

I know this is is getting to be a long function!! This is because there are a lot of steps involved in getting the text ready to work this. My goal is that by building the code here, you will be able to reference this if you run into similar issues when working with your Data Analysis data.

Question 7

Based on the words you see in the top 10 list, are there any other words you think need to be removed in order to provide the most insight to our client? If so, state which words, and show the top 10 after removing them all.

Note: There is no single correct answer to this!! Just try out what you think is reasonable.

Question 8

Based on the final top 10 positive words you chose in Question 7, what do clients like about the mouse?

Question 9

Repeat the same process to show the top 10 negative words in the Amazon Reviews. Does anything like Issue 1 or Issue 2 show up? Explain if so, and show the top 10 negative words after handing these issues (if any show up).

Question 10

Based on your answer to Question 9, what do the customers not like about the mouse?

We will likely notice with some of these that we would like to know what was bad or what broke. We will get into that with bi-grams soon!!

Other Uses of Sentiment Analysis

So far, we have determined the key words in positive or negative reviews. This is really helpful for the client. They now ask us one more question.

On Amazon, reviewers leave 1 to 5 stars to indicate how much they liked the product. In theory, 5 star reviews should be the people who liked the product the most and 1 star reviews are from people who liked the products the least. The client wants to know if 5 star reviews actually have more positive text than other star reviews.

With the Bing lexicon, we can only count the number of positive and negative words. This means that if we want some way to measure how positive a review is overall, we might consider subtracting the number of negative words from the number of positive words:

\[Score = \text{Number of Positive Words} - \text{Number of Negative Words}\]

This is an option, but in this case the word “good” and “amazing” will both be given the same weight in creating a score - it is one positive word. If we want to consider how positive or negative words are, we need to switch to a lexicon that gives us a score for how positive and negative words are in. The AFINN lexicon helps us do this.

Like with the Bing Lexicon, our first step is to use inner_join to determine which words in the Amazon Reviews are in the AFINN lexicon. Recall that our first Amazon Review is

##    review_score ID    word
## 1             5  1       i
## 2             5  1    wish
## 3             5  1   would
## 4             5  1    have
## 5             5  1  gotten
## 6             5  1     one
## 7             5  1 earlier
## 8             5  1    love
## 9             5  1      it
## 10            5  1     and
## 11            5  1      it
## 12            5  1   makes
## 13            5  1 working
## 14            5  1      in
## 15            5  1      my
## 16            5  1  laptop
## 17            5  1      so
## 18            5  1    much
## 19            5  1  easier

When we inner_join with AFINN, we get:

AFINN <- get_sentiments("afinn")

inner_join( tidy_Amazon1, AFINN)

##   review_score ID word value
## 1            5  1 wish     1
## 2            5  1 love     3

Instead of getting a label for whether a word is positive or negative like we did with Bing, we get a numeric value between -5 and 5. Positive values are associated with positive sentiment and negative values are associated with negative sentiment. To get a total sentiment score for a review, we can then just add up the values assigned to each of the words in the review that are also in AFINN.

Question 11

What total sentiment score would we give the first review?

We could do this for each review this way, but I’ve built a function so that can create an AFINN score for every review in the data set all at once. To get the function, copy and paste the following into a chunk and press play. DO NOT CHANGE ANYTHING ABOUT THIS CODE!!!

get_afinnscore <- function( data, textcolumn, option){
 
  if(option == "Sum"){
    afinn <- data |>
      unnest_tokens(word, textcolumn) |> 
      inner_join(get_sentiments("afinn")) |>
      group_by(ID) |>
      summarise(sentiment = sum(value))
  
    dataout <- data |>
     full_join(afinn, by = "ID")|>
      mutate(sentiment = replace_na(sentiment,0))
  }
  
  if(option == "Mean"){
    afinn2 <- data |>
      unnest_tokens(word, textcolumn) |> 
      inner_join(get_sentiments("afinn")) |>
      group_by(ID) |>
      summarise(AvgSentiment = mean(value))

    dataout <- data |>
      full_join(afinn2, by = "ID")|>
      mutate(AvgSentiment = replace_na(AvgSentiment,0))
  }
  
  dataout 
}

Once you have hit play on the chunk above, to use get a total sentiment score by adding up all the AFINN scores, we use:

AmazonReviews <- get_afinnscore(AmazonReviews, "cleaned_review", "Sum")

This adds a new column called sentiment to our AmazonReviews data set that gives the total sentiment score for each review.

Question 12

Create a plot showing how our sentiment score relates to the number of stars given on the Amazon reviews. Based on what you see, does it look like the number of stars are strongly related to how positive or negative the reviews are?

Your client now asks if maybe our results are just due to the fact that some reviews are longer than others. They ask us to get an average sentiment score rather than a total sentiment score. This means that instead of just adding up the sentiment score of all the words in each review, we also divide by the total number of positive and negative words in the review.

We can do this by making one small change to our code from the previous question. To use get an average sentiment score for each review, we use:

AmazonReviews <- get_afinnscore(AmazonReviews, "cleaned_review", "Mean")

This adds a new column called AvgSentiment to our AmazonReviews data set that gives the average sentiment score for each review.

Question 13

Create a plot showing how the average sentiment score relates to the number of stars given on the Amazon reviews. Based on what you see, does it look like the number of stars are strongly related to how positive or negative the reviews are on average?

References

Lexicons

AFINN: Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903.

bing: Minqing Hu and Bing Liu, ``Mining and summarizing customer reviews.’’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.

Code

The code was adapted from of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.