Data 607 - Project - Text Sentiment the Tidy Way - Using Sentiment Analysis on Amazon Reviews

Introduction

Can sentiment analysis accurately read the positivity or negativity of a product review? We will analyze a large dataset of Amazon reviews using sentiment analysis to find out.

A project earlier in the semester focused on performing text processing using the conventional document-term matrix approach and the tm package. Utilizing sentiment analysis, we attempted to build a model to detect spam emails using a training corpus of labelled spam and non-spam (ham) emails. I struggled a bit both processing the text and building the model.

This project will take another shot at sentiment analysis - this time using a tidy approach. We will take a data set of Amazon reviews that includes both review text and a star rating and look at if sentiment analysis of the review text can predict the star ratings.

This exercise will lean heavily on detailed example blog posts by its two authors, Julia Silge and David Robinson, who together wrote the book Text Mining with R: A Tidy Approach.

Data Source

Dr. Julian McAuley of the University of California-San Diego and Stanford University makes a huge set of Amazon reviews available for download. The reviews include ratings and review text, including review helpfulness ratings, along with product metadata like descriptions, prices, and similar products that customers also viewed. Time period for the reviews span from 1996 to 2014. Dr. McAuley offers “smaller” (still usually 100,000+) data sets of reviews by category for immediate download. Bigger sets are available by request. For our purposes, the smaller data sets will suit us fine.

Note that these Amazon data sets have already undergone some “cleaning” in attempts to remove duplicates and fraudulent or mistaken reviews. Most notably, our review data will only include products that have five or more reviews and reviews from individuals who have written at least five reviews. This helps prevent fraudulent or outlier reviews. Or the guy I know from middle school who appears to be writing the only positive reviews of his own self-published book under assumed names that have only written one review.

We are going to start with reviews for products from the Baby category. We will process and analyze the Baby review data. We will take slice and dice the text in a number of ways. Next, we will take a look at reviews from another category to see if sentiment analysis of some Amazon categories are more closely related to the star ratings than others. Note that we won’t actually be doing any modelling. We are just looking to establish sentiment analyses of text that display reliable linear relationships to the accompanying star ratings.

Purpose

There is likely not direct business utility to confirming that sentiment analysis can be constructed to be predictive of star ratings. So why are we doing this? One, the accompanying star ratings provide pretty solid evidence that, if correlation is established, sentiment analysis works. If done correctly, five-star ratings should have higher mean sentiments than four-star ratings, four than three, etc.

There are likely adjacent applications to business using a similar approach. Many reviews or general blog, forum, or social media posts are not accompanied by star ratings. Employees at a company who are responsible for reputation management, social media, or just general marketing could set up a web crawler to find very positive or negative posts related to their enterprise.

Legal document review or text mining of physician notes might not lend themselves to positive/negative sentiment analysis, but we could definitely foresee other use cases for trying to derive automated sentiment of some sort from these types of text. Sentiment analysis does not need to be used to make automated decisions. It could just be used to prioritize documents, text, or some sort of textual objects for human review. If a big court case has 20,000 documents, there could be huge value in having automated text processing that provides a prioritized plan of attack for review.

Let’s get back to our straightforward look at the Amazon reviews.

Data Setup

This project requires dealing with two different types of data sources. I will cheat a little here. The first data set - the baby reviews - will be manually downloaded in .gz format, extracted into .json form, and then read into R from the local machine due to file size. For the second set of reviews related to Grocery and Gourmet food, we will pro grammatically download and unzip the file in R. In the end, both data sets will be in json format before analysis begins. Adding a data set from a different source or running one of these data sets through a database before bringing them back into R would be an exercise in box-checking. This semester, we had more difficulty programmatically downloading and extracting files from the web than we have working with databases through R, so we could use the extra repetitions here.

The main data set of interest of 160,792 baby reviews is 37 MB, which exceeds Github’s free public file size limit. To follow along, download the file using the link. Next, unzip the compressed .gz file to a .json file using a file utility like 7-Zip. Finally, we read that file into R.

library(dplyr)
library(readr)

baby_file <- "Baby_5.json"
#read_lines creates character vector for each line
review_lines <- read_lines(baby_file, progress = FALSE)

Following David Robinson’s example, let’s turn this into a tibble with columns for the reviews and accompanying metadata.

library(stringr)
library(jsonlite)

reviews_combined <- str_c("[", str_c(review_lines, collapse = ", "), "]")

#flatten and turn into a tibble
df_reviews <- fromJSON(reviews_combined) %>%
  flatten() %>%
  tbl_df()

Let’s check out a sample of the tibble to verify the format is as intended.

library(kableExtra)

#Limiting the tibble to 5 rows here because otherwise kable will process whole thing
kable(df_reviews[1:5,]) %>%
  kable_styling("striped", full_width = F)

reviewerID	asin	reviewerName	helpful	reviewText	overall	summary	unixReviewTime	reviewTime
A1HK2FQW6KXQB2	097293751X	Amanda Johnsen “Amanda E. Johnsen”	c(0, 0)	Perfect for new parents. We were able to keep track of baby’s feeding, sleep and diaper change schedule for the first two and a half months of her life. Made life easier when the doctor would ask questions about habits because we had it all right there!	5	Awesine	1373932800	07 16, 2013
A19K65VY14D13R	097293751X	angela	c(0, 0)	This book is such a life saver. It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn. I think it is one of those things that everyone should be required to have before they leave the hospital. We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1. See other things that are must haves for baby at […]	5	Should be required for all new parents!	1372464000	06 29, 2013
A2LL1TGG90977E	097293751X	Carter	c(0, 0)	Helps me know exactly how my babies day has gone with my mother in law watching him while I go to work. It also has a section for her to write notes and let me know anything she may need. I couldn’t be happier with this book.	5	Grandmother watching baby	1395187200	03 19, 2014
A5G19RYX8599E	097293751X	cfpurplerose	c(0, 0)	I bought this a few times for my older son and have bought it again for my newborn. This is super easy to use and helps me keep track of his daily routine. When he started going to the sitter when I went back to work, it helped me know how his day went to better prepare me for how the evening would most likely go. When he was sick, it help me keep track of how many diapers a day he was producing to make sure he was getting dehydrated. The note sections to the side and bottom are useful too because his sitter writes in small notes about whether or not he liked his lunch or if the playtime included going for a walk, etc.Excellent for moms who are wanting to keep track of their kids daily routine even though they are at work. Excellent for dads to keep track as my husband can quickly forget what time he fed our son. LOL	5	repeat buyer	1376697600	08 17, 2013
A2496A4EWMLQ7	097293751X	C. Jeter	c(0, 0)	I wanted an alternative to printing out daily log sheets for the nanny to fill out, and this has worked out great! I’m no longer searching my daughter’s bag for a crumpled piece of paper each day. It’s also nice to be able to look back on previous days and weeks for eating and sleeping patterns. I would have preferred a plastic-type cover, but it’s held up well so far.	4	Great	1396310400	04 1, 2014

So this looks good. We have our “reviewText” column for sentiment analysis, and the 1-to-5 scaled rating in the “overall” variable.

Tidy Text

This semester we have learned - and as Julia Silge reiterates in her discussion of the tidytext package - tidy principles dictate that in a data set, each variable is a column, each observation a row, and each manner of observational unit a table. As she states, tidy data sets are easier to work with, and this is no less true when one starts to work with text." Let’s get to it.

As the Text Mining with R book states, the essence of the tidy text format is a data frame or tibble with one token, usually just a word, per row. The tidytext package includes a function called “unnest_tokens” that will convert the text into individual tokens, words.

The Yelp review dataset from David Robinson’s example includes a ReviewID attribute, a primary key or unique identifier for each review. This comes in handy after breaking the reviews into individuals tokens or words - a way to associate the rows with other rows, or words, from the same review.

The Amazon dataset does not have such a unique identifier for reviews, so let’s create one while there is still one row per review.

df_reviews <- df_reviews %>% mutate(reviewID = row_number())

Now, we unnest the text column.

library(tidytext)

df_reviews_words <- df_reviews %>%
  #only new subset of columns for our exercise
  select(asin,reviewID,reviewText,overall) %>%
  #see https://www.rdocumentation.org/packages/tidytext/versions/0.2.0/topics/unnest_tokens
  unnest_tokens(word,reviewText) %>%
  #following Robinson,s example, we are running a NOT IN on stop words and formatting
  #could also use an antijoin here
  filter(!word %in% stop_words$word,
         str_detect(word,"^[a-z']+$"))

Wow, we went from 160,000 rows (reviews) to more than 5 million rows (words). We’re only keeping the asin (a product ID), our created unique reviewID, the now unnested reviewText - which has become “word,” and then the “overall” or rating. Let’s take a closer look.

library(kableExtra)

#Limiting the tibble to 5 rows here because otherwise kable will process whole thing
kable(df_reviews_words[1:5,]) %>%
  kable_styling("striped", full_width = F)

asin	reviewID	overall	word
097293751X	1	5	perfect
097293751X	1	5	parents
097293751X	1	5	track
097293751X	1	5	baby’s
097293751X	1	5	feeding

Sentiment Analysis Using AFINN

That’s what we wanted, so we progress to sentiment analysis, borderline plagiarizing from David Robinson’s Yelp review post. Sentiment analysis utilizes certain lexicons, which are ways of assigning values to words or phrases. For our ratings purposes, the AFINN lexicon will be used, as it provides scores for words on a scale of -5 (negative) to 5 (most positive).

First, we create a tibble from the sentiments table that comes with tidytext that only contains the values for the AFINN lexicon.

AFINN_lex_sent <- sentiments %>%
  filter(lexicon == "AFINN") %>%
  select(word, afinn_score = score)

Now we join that tibble with the AFINN lexicon to our tidy text table of Amazon reviews.

AFINN_reviews_sentiment <- df_reviews_words %>%
  inner_join(AFINN_lex_sent, by = "word") %>%
  group_by(reviewID, overall) %>%
  summarize(sentiment = mean(afinn_score))

Previewing it.

library(kableExtra)

#Limiting the tibble to 20 rows here because otherwise kable will process whole thing
kable(AFINN_reviews_sentiment[1:20,]) %>%
  kable_styling("striped", full_width = F)

reviewID	overall	sentiment
1	5	3.0000000
2	5	0.5000000
3	5	2.0000000
4	5	1.2857143
5	4	3.0000000
6	4	-2.0000000
8	5	2.0000000
9	3	1.5000000
10	5	2.4000000
11	5	-1.0000000
12	5	1.0000000
14	5	3.0000000
16	3	1.2727273
17	3	1.7142857
18	5	1.6666667
19	5	2.0000000
20	5	2.0000000
21	5	-0.6666667
22	5	1.6000000
23	5	-0.5000000

Just in those first 20 rows, we see AFINN values that are all over the map, including some negative scores on reviews accompanied by five-star ratings. This could get interesting.

Again, stealing blatantly from the work others, let’s create a summary table and box plots that show the mean sentiment scores by star rating.

library("data.table")

dt_ars <- data.table(AFINN_reviews_sentiment)
dt_ars <- dt_ars[,list(mean=mean(sentiment),sd=sd(sentiment)),by=overall]
dt_ars[order(overall)]

##    overall        mean       sd
## 1:       1 -0.09909415 1.331249
## 2:       2  0.31950832 1.322449
## 3:       3  0.63546456 1.299319
## 4:       4  1.03324506 1.198206
## 5:       5  1.44055007 1.188802

library(ggplot2)

ggplot(AFINN_reviews_sentiment, aes(overall, sentiment, group = overall)) + geom_boxplot()

There’s a clear relative correlation between the aggregate AFINN sentiment score of a review and its star rating. However, the effect is not nearly as significant as David Robinson saw in his similar exercise with Yelp reviews. Might that be because reviews for baby products are written differently than the restaurant reviews the dominate Yelp? Or maybe Yelp’s more social, place-based nature lends itself to more expressive types of writing?

Outlier Analysis

Just for fun, let’s take a look at a couple of outliers. First, the one-star reviews with a sentiment rating about 3.75.

AFINN_reviews_sentiment_outlier_1 <- AFINN_reviews_sentiment %>%
  filter(overall == 1 & sentiment > 3.75) %>%
  select(reviewID)

df_reviews_outlier_low <- df_reviews %>%
    select(reviewID,reviewText,overall) %>%
  #converting tibble column to list so it works in "IN" filter predicate
  filter(reviewID %in% as.list(AFINN_reviews_sentiment_outlier_1$reviewID))

kable(df_reviews_outlier_low) %>%
  kable_styling("striped", full_width = F)

reviewID	reviewText	overall
3367	It made my breasts look funny, it made my nipples more irritable & it left painful marks on my breasts. Hardly ever used it. :(	1
13469	I bought this pump because I was planning on exclusively breastfeeding and thinking I wouldnt need an electric pump since I would only be doing the occasional pumping. Well there is a reason the product comes with a 30 day warranty… because on the 31st day it isnt going to work anymore. At first it worked amazing, but after using it only a handful of times, it started to lose suction. I tried contacting Medela about it but they sent me an email reminding me of the 30 day warranty. So I finally caved and bought an electric pump… a Lansinoh.	1
14386	Great. So my wife and I purchased this ultra-convenient frame/stroller for convenience. It worked great for about 8 weeks, then one of the front wheels bent to where we couldn’t use it anymore. AND, the front wheels would at times jam the stroller from opening up in the first place. In concept, a frame/stroller like this is awesome, but Graco needs to redesign this and make it better and sturdier. All in all, a major dissapointment.	1
27072	I am a new mother and my daughter is now almost 5 months old. We have had this toy since she was born. I thought it would be a fun toy for her but so far she has not been very interested. Occasionally she will pick the wrist rattles up and look at them. When I put the socks on her they do not help her find her feet. The rattle does not make enough noise to hold her interest. If you are a first time parent considering this toy I would skip it. Your baby will find their feet on their own.	1
33447	Replaced by a bizzilion phone apps. Would have been awesome in the 90s.	1
41846	We have a Evenflo Triumph LX Platinum Convertible Car Seat (manufacture date: March 2014) and it barely fits the carseat. The car seat is too wide for the travel bag. My husband had to hold the zipper teeth together as tightly as he could in order for me to zip it up and it strained the zipper (I thought we’d have to duct tape the bag closed). You can’t put anything else in the bag either. Will have to purchase a new car seat travel bag.	1
59078	I had flat nipples and thought that this would help. It did nothing!! I’m a first time mom and was concern that I wouldn’t be able to breastfed. My son is 8 months old now and I can tell you, a breast pump will help you if you use it often enough. For myself, at least 5 times a day for at least 30 minutes. It changed everything! Wow.	1
71063	This thing was awesome the first 2 times. I let the food overflow the last time and when I folded it to take the cubes out, the whole thing tore apart. I use Wilton brownie bite silicon trays instead for homemade baby food.	1
71973	All you need its the original Nas Frida, this is just extra money not needed to spend. I wish I didn’t purchase. I do however think the NASFRIDA product is a lifesaver and will order for any of my expectant friends.	1
92956	We just returned this tub from Amazon. My son is 3 months old, and needed a tub that reclines and is long enough for him to stretch his legs. It hardly reclines, and was very short. His legs went passed the inside of the tub. Inside the liner are plastic sharp pieces of the tub, and my baby scratched his head on one of them. I think it is very small, and could not imagine a toddler having any fun in this tub. I am searching for another one, in the meantime we are just using his infant tub from wal-mart.	1
103423	I saw this chair and thought, “Wow Fisher-Price just made the world’s greatest high chair” Unfortunately I didn’t realize that the crotch post thing is so close to the seat back that I can barely cram my 7-month-old, 30th percentile for weight, skinny little baby behind it. It has everything a parent could wish for except enough room to put your baby in it :(	1
131548	I clearly got carried away with the hype for all the new things out there to transition a baby to the cup. Wow when my mom said just use a cup. I will I’m done with the wired illogical designs. Time for a cup, use a cup.	1
143155	The theory behind this is brilliant but in practice it just doesn’t do the job. It’s not like with the angle, you can sit as you normally do. You still need to angle yourself a bit. The price is also way too high for something like this.	1

We see a clear mistaken one-star review (reviewID3367). We have multiple interesting cases where a reviewer dismisses the product they bought but then goes on to offer extended praise for an alternative product. Also, there are some clearly negative reviews that just don’t use many negative words, like reviewID 41486.

Next, let’s examine the five-star review that came in with a sentiment rating below -3.75.

AFINN_reviews_sentiment_outlier_5 <- AFINN_reviews_sentiment %>%
  filter(overall == 5 & sentiment < -3.75) %>%
  select(reviewID)

df_reviews_outlier_high <- df_reviews %>%
    select(reviewID,reviewText,overall) %>%
  #converting tibble column to list so it works in "IN" filter predicate
  filter(reviewID %in% as.list(AFINN_reviews_sentiment_outlier_5$reviewID))

kable(df_reviews_outlier_high) %>%
  kable_styling("striped", full_width = F)

reviewID	reviewText	overall
2668	We call this tub “Moby Dick” in our house on account of it’s size, but this has been a great tub for my now 16 mo old. He really cannot slip in this tub and has ample room to play and splash. I use the baby side to shampoo his hair.All in all, a great tub!	5

Just one outlier here. Going back to our original unnested tidy text tibble entries for that review, let’s take a look at how individual words in it scored on the AFINN lexicon:

df_reviews_words %>%
  inner_join(AFINN_lex_sent, by = "word") %>%
  filter(reviewID == 2668)

## # A tibble: 1 x 5
##   asin       reviewID overall word  afinn_score
##   <chr>         <int>   <dbl> <chr>       <int>
## 1 B000056OV0     2668       5 dick           -4

OK…We could prevent cases like this by removing capitalized words that don’t occur at the beginning of sentences. Also, we see that the review only contained one word with an AFINN score? The below tells us that only 2,476 words have sentiment scores from the AFINN lexicon, so that seems plausible.

sentiments %>%
  filter(lexicon=="AFINN")

## # A tibble: 2,476 x 4
##    word       sentiment lexicon score
##    <chr>      <chr>     <chr>   <int>
##  1 abandon    <NA>      AFINN      -2
##  2 abandoned  <NA>      AFINN      -2
##  3 abandons   <NA>      AFINN      -2
##  4 abducted   <NA>      AFINN      -2
##  5 abduction  <NA>      AFINN      -2
##  6 abductions <NA>      AFINN      -2
##  7 abhor      <NA>      AFINN      -3
##  8 abhorred   <NA>      AFINN      -3
##  9 abhorrent  <NA>      AFINN      -3
## 10 abhors     <NA>      AFINN      -3
## # ... with 2,466 more rows

Baby reviews may be unique. We see many examples of a negative review for a baby product turning into a recommendation for another product or solution. A Yelp review may instead recommend another restaurant, but this appears to be much less common. Let’s look at reviews from some other Amazon product categories to see if sentiment analysis using AFINN is a better predictor of their star rating.

AFINN Sentiment Analysis of Another Product Category: Grocery Reviews

Let’s download the dataset of reviews from Amazon’s Grocery and Gourmet Food categories. As mentioned earlier, we’re going to automate grabbing and unzipping the data from the url. Again, we had much more difficulty with these types of steps this semester than with working with different data types and tools.

library(R.utils)

#download and unzip file
fileloc <- "C:\\Users\\littl\\Documents\\CUNY Data Science\\DATA 607\\Final Project\\reviews_Grocery.json.gz"
download.file("http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Grocery_and_Gourmet_Food_5.json.gz",fileloc)
gunzip(fileloc)

#read into R
grocery_file <- "reviews_Grocery.json"
#read_lines creates character vector for each line
review_grocery_lines <- read_lines(grocery_file, progress = FALSE)

Now we’re going to speed through the data preparation steps, with the goal of creating another summary table and boxplot analyzing sentiment scores from the Amazon grocery reviews.

grocery_reviews_combined <- str_c("[", str_c(review_grocery_lines, collapse = ", "), "]")

#flatten and turn into a tibble
df_grocery_reviews <- fromJSON(grocery_reviews_combined) %>%
  flatten() %>%
  tbl_df()

df_grocery_reviews <- df_grocery_reviews %>% mutate(reviewID = row_number())

df_grocery_reviews_words <- df_grocery_reviews %>%
  #only new subset of columns for our exercise
  select(asin,reviewID,reviewText,overall) %>%
  #see https://www.rdocumentation.org/packages/tidytext/versions/0.2.0/topics/unnest_tokens
  unnest_tokens(word,reviewText) %>%
  #following Robinson,s example, we are running a NOT IN on stop words and formatting
  #could also use an antijoin here
  filter(!word %in% stop_words$word,
         str_detect(word,"^[a-z']+$"))

AFINN_grocery_reviews_sentiment <- df_grocery_reviews_words %>%
  inner_join(AFINN_lex_sent, by = "word") %>%
  group_by(reviewID, overall) %>%
  summarize(sentiment = mean(afinn_score))

And now let’s see the results.

dt_agrs <- data.table(AFINN_grocery_reviews_sentiment)
dt_agrs <- dt_agrs[,list(mean=mean(sentiment),sd=sd(sentiment)),by=overall]
dt_agrs[order(overall)]

##    overall       mean       sd
## 1:       1 -0.1742537 1.477033
## 2:       2  0.4062866 1.393347
## 3:       3  0.7101844 1.342217
## 4:       4  1.1448954 1.228968
## 5:       5  1.5228517 1.246981

ggplot(AFINN_grocery_reviews_sentiment, aes(overall, sentiment, group = overall)) + geom_boxplot()

Sentiment Analysis: Other Lexicons

As previously mentioned, the AFINN lexicon includes sentiment ratings on a scale of -5 to 5. As the Tidy Text Mining book details in Chapter 2, we can use other lexicons as well. The nrc lexicon for example, labels words as expressing certain sentiments rather than scoring them. However, those “sentiment” labels include “positive” and “negative,” so you can count the number of positive and negative words as an attempt at measuring sentiment. You can also look at other sentiments such as “joy” to try to get a feel for the positivity of text. The bing lexicon offers a similar functionality, with a narrower focus on just positive and negative sentiment.

Just for fun, let’s look at the reviews and ratings using a different lexicon. Borrowing heavily from Chapter 2 of the book, let’s try to get a feel for sentiment using the bing dataset. We will return to the baby product reviews. Ignoring the ratings, what words (with a Bing value) appear most frequently in the Baby product reviews, and what is their bing sentiment?

library(tidyr)


bing_sentiment <- df_reviews_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment,sort = TRUE) %>%
  ungroup()

bing_sentiment

## # A tibble: 3,888 x 3
##    word      sentiment     n
##    <chr>     <chr>     <int>
##  1 love      positive  41384
##  2 easy      positive  41259
##  3 nice      positive  20990
##  4 loves     positive  19343
##  5 recommend positive  17832
##  6 soft      positive  17133
##  7 clean     positive  16695
##  8 perfect   positive  15727
##  9 hard      negative  12769
## 10 pretty    positive  12609
## # ... with 3,878 more rows

Let’s try to duplicate our analysis from earlier that used AFINN to see how a derivation of sentiment - we’ll subtract a count of positive words by a count of negatives per review - looks using bing.

BING_lex_sent <- sentiments %>%
  filter(lexicon == "bing") %>%
  select(word, sentiment)

#adding score
BING_lex_sent <- BING_lex_sent %>%
  mutate(bing_score = ifelse(sentiment == "positive",1,-1))

BING_reviews_sentiment <- df_reviews_words %>%
  inner_join(BING_lex_sent, by = "word") %>%
  group_by(reviewID, overall) %>%
  summarize(sentiment_score = sum(bing_score))

dt_brs <- data.table(BING_reviews_sentiment)
dt_brs <- dt_brs[,list(mean=mean(sentiment_score),sd=sd(sentiment_score)),by=overall]
dt_brs[order(overall)]

##    overall       mean       sd
## 1:       1 -1.3534413 3.131554
## 2:       2 -0.3586794 3.043360
## 3:       3  0.4932449 2.979904
## 4:       4  1.5137988 3.231311
## 5:       5  2.2415361 2.906329

This looks promising, though note the large standard deviation. Much larger than with AFINN, though we’re using a significantly different methodology here. Let’s take a look at the box plot.

ggplot(BING_reviews_sentiment, aes(overall, sentiment_score, group = overall)) + geom_boxplot() +
  #mean
   stat_summary(fun.y = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y..),
                 width = .75, linetype = "dashed")

We’re adding a dashed line to show the mean using ggplot’s error bar functionality, as our median will always be an integer. At first glance, we see less difference in sentiment between the rankings. But look at the score the range.

As a reminder, the AFINN sentiment score for a review was deriving by taking all the words from a review that had entries in the AFINN lexicon, and then taking a mean of all the sentiment scores. With the BING, sentiment, we’re assigning a 1 to words with a “positive” sentiment value and a -1 to those with a “negative” sentiment. The review’s BING sentiment is calculated by simply adding all those scores together for any words in the review that are in the lexicon. Because of this, we see a much larger range of sentiments, with a handful of reviews approaching -50 and 50.

Let’s create another boxplot but limit the y-scale to the -5 to 5 scale.

ggplot(BING_reviews_sentiment, aes(overall, sentiment_score, group = overall)) + geom_boxplot() + ylim(-5,5) +
     stat_summary(fun.y = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y..),
                 width = .75, linetype = "dashed")

## Warning: Removed 14665 rows containing non-finite values (stat_boxplot).

## Warning: Removed 14665 rows containing non-finite values (stat_summary).

Our methodology created scores using the BING lexicon actually shows a larger effect by rating, though the spread among sentiment score values for a rating is significantly bigger as well. Depending on the ultimate aim, even though the scores provided by AFINN seem like the likely best solution, other lexicons may better suit the task.

Conclusion

Using the star ratings as a check, we established that sentiment analysis can distinguish between positive and negative reviews. We also saw that - at least in my opinion - the tidy text approach makes working with text easier and more straightforward.

In performing sentiment analysis, there are many options in terms of approaches. After preparing a data set for review, decisions have to be made about whether or not to include stop words, capitalized words, or other phrases. Next, one must consider available lexicons. After getting an initial set of results and performing some analysis, it’s probably worth considering revising these choices. Additionally, delving into the text to seeing if there are recurring words or phrases throwing off the results would often be a good step. In social media text, for example, there might be recurring slang or phrases from memes negatively affecting the results in material ways that could be handled. After additional analysis, it might be worth exploring weighting the scores in different manners, for example, based on where they occur in the review.

For this exercise, we stuck with individual words and didn’t even look at phrases. Methods like n-grams and pairwise calculations also play nicely with tidy text data and could likely be used to improve the predictions. Applying machine learning methods to the text would likely lead to better outcomes in many cases.

Sources

Text Mining with R A Tidy Approach Julia Silge and David Robinson First Published: 2018-09-23 Link

David Robinson’s blog: Does sentiment analysis work? A tidy analysis of Yelp reviews

Julia Silge’s blog (numerous entries)

Amazon Review Data - per Dr. McAuley’s direction, citing the following paper - though he means citing when publishing in academic manner: Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW, 2016 PDF