In this lab, we will use tidy text techniques to analyze a dataset of amazon reviews. Each problem utilizes the tidy text mining techniques described in either chapter 2 (Problem 1), chapter 3 (Problem 2), or chapter 4 (Problem 3) of the tidy text mining with r textbook. Note: the dataset for this assignment is a bit bigger than what we have typically worked with in the class. On my computer everything worked fast enough, but if your computer is older and you find the computations intolerably slow you may reduce the size of the dataset by 90% by taking only the first 10% of reviews. If you do this make sure it is clearly stated. I have also listed a second shorter version of the file.
Problem 1: Sentiment and Review Score
I will be using the shorter version of the dataset where 90% of the reviews have been dropped.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
To make sentiment analysis possible, add an index variable to the review data frame so that each review is uniquely identified by an integer. Then tokenize the review data frame using words as the tokens, and remove all stop words from the data set.
Does sentiment correlate with reviews? Use the afinn lexicon to calculate a sentiment score for each review, normalizing by the number of lexicon words in each review. Visualize the distribution of sentiment scores for each rating and calculate the mean sentiment score for each review category. What do you observe?
afinn <-get_sentiments("afinn")shortdata_sentiment <- shortdata_tokens |>inner_join(afinn, by ="word")
Next we will calculate the sentiment scores for each review by summing the afinn scores. We’ll divide the summed score by the count of words in each review in order to normalize the sentiment scores (this will avoid skewing the scores by review length).
ggplot(shortdata_sentiment_reviewscores, aes(x =factor(overall), y = sentimentscore)) +geom_boxplot() +labs(title ="Distribution of Sentiment Scores for each Rating",x ="Review Rating", y ="Sentiment Score")
Next we’ll calculate and visualize the mean sentiment score for each rating. It’s very interesting that the sentiment score of ‘0’ is the median for a review rating of 2 – I might’ve thought that on a scale of 1-5, 3 would be more likely to have a median sentiment score of 0. It’s very interesting that a review rating of 1, 4, and 5 show outliers while 2 and 3 do not. It’s especially interesting that the distribution in the interquartile range is broadest for the review score of 3. After thinking about it, I wonder how much ‘storytelling’ language goes into these types of reviews, in particular for ‘extreme’ review ratings of 1 and 5. For example, I’ve seen a fair share of online reviews that go something like, “frying eggs used to be a nightmare - my old nonstick skillet was the worst. But this new one is amazing and solves all my problems! Five stars!” so it recalls the ‘negative’ language of ‘before I got this item’ and that sentiment is still part of the review. I also have seen reviews that go the other way, like “I purchased a pair of these mittens every year for the past 5 years and loved them, but now it seems it’s a new manufacturer and they’re terrible - itchy and thin - 1 star!”
meanSentiment_byrating |>ggplot(aes(x =factor(overall), y = meanSentiment)) +geom_col(fill ="burlywood", color ="saddlebrown") +## I tried to make the columns evoke Amazon packages :D# geom_point(size = 8, color = "turquoise") +labs(title ="Mean sentiment score by review rating",x ="Review Rating", y ="Mean sentiment score" )
When we look simply at the mean sentiment score by review rating I see a trend that ‘makes sense’ with much less of the nuance as in the boxplot above. As expected, the mean sentiment score of a 1-star review is the most negative, and 2 has much ‘neutral’ language, followed by a growth up to 5 which has the highest mean sentiment score.
Reviewer Personalities: For each reviewer, compute the number of ratings, the mean sentiment, and the mean review score. Filter for reviewers who have written more than 10 reviews, and plot the relationship between mean rating and mean sentiment. What do you observe?
First I need to get the reviewerIDs back from the prior step – the sentiment review scores dataframe no longer had the reviewer IDs.
sentiment_reviewscores_reviewers <- shortdata_sentiment_reviewscores |>left_join(shortdata_tokens |>select(reviewID, reviewerID), by ="reviewID")sentiment_reviewscores_reviewers
reviewers |>ggplot(aes(x = meanRating, y = meanSentiment)) +geom_point(color ="saddlebrown", alpha =0.5) +labs(title ="Relationship between Reviewers' mean rating and mean sentiment")
This is a bit strange but also makes sense considering some of my observations noted in the previous part of the exercise - that sometimes positive sentiments appear in negative reviews, and vice versa, so there’s less ‘extreme’ on both ends of the spectru, and more ‘in between’. My initial observation is that there is a broader range of mean sentiment scores for the reviewers with a mean rating of 5. it seems like there’s a vaguely (perhaps) linear relationship between the meanSentiment and meanRating though I can’t say for certain if there would be other ways of getting to a more valid ‘mean’ like perhaps I needed to normalize like in previous steps. It is interesting that most sentiment scores tend to hover around the neutral-to-positive range, visible in the space between the meanRating of 4 and 5, and the meanSentiment above 0 but less than 2.5.
Problem 2: Words with high relative frequency
As your starting point, take the tokenized data frame that has been filtered to remove stop words, but hasn’t been joined with the sentiment lexicon data. For each item (asin), use the bind_tf_idf function to find the word that occurs in the reviews of that item with the highest frequency relative to the frequency of words in the entire review test dataset.
First, I will use the bind_tf_idf() function to compute the tf-idf (term frequency - inverse document frequency) scores for each word in each asin item.
shortdata_tfidf <- shortdata_tokens |>count(asin, word, sort =TRUE) |>bind_tf_idf(word, asin, n)shortdata_tfidf
Next we’ll find the word that occurs in the reviews of that item with the highest frequency relative to the frequency of words in the entire review test dataset.
topwords_asin <- shortdata_tfidf |>group_by(asin) |>slice_max(tf_idf, n =1) |>ungroup()topwords_asin
# A tibble: 1,775 × 6
asin word n tf idf tf_idf
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 0449819906 book 21 0.0464 2.29 0.106
2 048625531X mazes 2 0.04 7.45 0.298
3 0486473082 book 7 0.156 2.29 0.356
4 0715329278 book 13 0.0599 2.29 0.137
5 0804844844 creases 2 0.0606 5.14 0.312
6 0823013626 donna 10 0.0252 6.75 0.170
7 0848724666 book 26 0.0942 2.29 0.216
8 0848734270 recipes 7 0.111 5.37 0.596
9 0887248845 jesus 3 0.0316 7.45 0.235
10 0982094825 book 4 0.133 2.29 0.305
# ℹ 1,765 more rows
Select five items from the dataset (either at random or by hand) and look up the asin code for those items on Amazon.com. In each of these cases, does the highest relative frequency word correspond to the identify or type of the item that you chose? You may not be able to find every single item, but I was able to find a solid majority of the ones I searched for by searching amazon.com for the asin.
Yes, these all seem to correspond to the identiy or type of the item: - asin 0804844844, word ‘creases’: is an Origami Paper - Cherry Blossom Patterns. This totally makes sense for the word ‘creases’ to be the word that’s most unique to this item’s reviews relative to the entire dataset. - asin 0823013626, word ‘donna’: this book about polymer clay is written by a person named Donna Kato. So this also makes a ton of sense. - asin 048625531X, word ‘mazes’: this is a book caled ‘Easy Mazes Activity Book’ so once again, yes, totally makes sense. - asin 0887248845, word ‘jesus’: this is a set of ‘dazzle stickers crosses’ in the shape of religious crosses that absolutely make sense to have a connection like this, totally makes sense. - asin 1601409788, word ‘handiwork’: this is a book called ‘Leisure Arts Crochet: pocket guide’ – like those I selected previously, yes, this totally makes sense as corresponding to the item.
Problem 3: Bigrams and Sentiment
Consider the two negative words not and don’t. Starting from the original dataset, tokenize the data into bigrams. Then calculate the frequency of bigrams that start with either not or don’t. What are the 10 most common words occurring after not and after don’t? What are their sentiment values according to the afinn lexicon?
bigrams <- shortdata |>unnest_tokens(bigram, reviewText, token ="ngrams", n =2)
OK, now that we’ve created bigrams, we’ll filter them to keep only those that start with “not” or “don’t”. To do this we will separate each bigram into its first and second word (so we see what comes after “not” and “don’t”) and then filter.
bigrams_first2words <- bigrams |>separate(bigram, into =c("word1", "word2"), sep =" ")
# A tibble: 20 × 3
word1 word2 n
<chr> <chr> <int>
1 don't buy 45
2 don't waste 45
3 don't mind 38
4 don't care 35
5 don't expect 34
6 don't feel 27
7 don't leave 22
8 don't lose 22
9 don't hold 21
10 don't understand 21
11 not recommend 103
12 not buy 93
13 not worth 88
14 not fit 65
15 not cut 57
16 not happy 41
17 not hold 41
18 not bad 37
19 not easy 37
20 not stay 37
Now to find the sentiment of each word2 according to the afinn lexicon we will do the following to get sentiment values for individual words.
notdont_bigrams_mostcommon_sentiments <- notdont_bigrams_mostcommon |>left_join(afinn, by =c("word2"="word")) |>select(word1, word2, n, value)notdont_bigrams_mostcommon_sentiments
# A tibble: 20 × 4
word1 word2 n value
<chr> <chr> <int> <dbl>
1 don't buy 45 NA
2 don't waste 45 -1
3 don't mind 38 NA
4 don't care 35 2
5 don't expect 34 NA
6 don't feel 27 NA
7 don't leave 22 -1
8 don't lose 22 NA
9 don't hold 21 NA
10 don't understand 21 NA
11 not recommend 103 2
12 not buy 93 NA
13 not worth 88 2
14 not fit 65 1
15 not cut 57 -1
16 not happy 41 3
17 not hold 41 NA
18 not bad 37 -3
19 not easy 37 1
20 not stay 37 NA
Strangely, it looks like there is no value associated with a number of the most common words after not and don’t. I am not sure why someting like “expect” or “understand” would be missing from afinn. In any case, among those that do have a value that’s got a sentiment score in afinn, for “don’t” we see two “-1” and one “2”, so nothing exceedingly negative but definitely leaning negative. We see more values associated (but not for all of the top 10) for the words that come after “not”, with the most negative being “bad” (score of -3) and the most positive being “happy” (score of 3), which interestingly make the phrases “not bad” and “not happy”, very very different sentiments on the whole. It’s interesting to observe how different breaking things into individual words changes the ‘sentiment score’ potentially, like the most common word after “not” is “recommend” and while “recommend” has a sentiment score of 2, the phrase “not recommend” as in “I would not recommend this item” is not a positive statement. compare that to “not bad” which isn’t as negative (in my opinion) as “not recommend” or “not happy” or “not worth” – another which has a word2 with a relatively positive value but the phrase “not worth” (as in “not worth the money” or something) does not. Super interesting to see how this all plays into sentiment analysis. Words are not islands!
Pick the most commonly occurring bigram where the first word is not or don’t and the afinn sentiment of the second word is 2 or greater. Compute the mean rating of the reviews containing this bigram. How do they compare the average review score over the entire dataset?
Interestingly we see that the overall mean rating of the dataset is 4.57 which is much higher than the overall mean rating for the reviews that start with “don’t care” or “not recommend”. I am not surprised that “not recommend” has a lower mean rating than “don’t care” because “not recommend” is more consistently (I think) a phrase that has a negative sentiment, other than something along the sentiment of “I can’t recommend this enough!”. But something like “don’t care” is more ambiguous - “don’t care what other reviewers say, this is AMAZING!”. Though I think it completely fits with our understanding that any reviews that start with “don’t care” or “not recommend” are associated with lower overall reviews compared with the mean rating overall across the entire dataset. Very interesting!