Lab 8: Text Mining and NLP

Author

Amanda Rose Knudsen

Overview

In this lab, we will use tidy text techniques to analyze a dataset of amazon reviews. Each problem utilizes the tidy text mining techniques described in either chapter 2 (Problem 1), chapter 3 (Problem 2), or chapter 4 (Problem 3) of the tidy text mining with r textbook. Note: the dataset for this assignment is a bit bigger than what we have typically worked with in the class. On my computer everything worked fast enough, but if your computer is older and you find the computations intolerably slow you may reduce the size of the dataset by 90% by taking only the first 10% of reviews. If you do this make sure it is clearly stated. I have also listed a second shorter version of the file.

Problem 1: Sentiment and Review Score

I will be using the shorter version of the dataset where 90% of the reviews have been dropped.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(stringr)
library(jsonlite)


Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten

library(textdata)

shortjson <- fromJSON("short_reviews.json")

shortdata <- shortjson |> 
  unnest(cols = c(overall, verified, reviewerID, asin, reviewText))

To make sentiment analysis possible, add an index variable to the review data frame so that each review is uniquely identified by an integer. Then tokenize the review data frame using words as the tokens, and remove all stop words from the data set.

shortdata <- shortdata |> 
  mutate(reviewID = row_number()) |> 
  relocate(reviewID, .before = 1)
shortdata

# A tibble: 50,000 × 6
   reviewID overall verified reviewerID     asin       reviewText               
      <int>   <int> <lgl>    <chr>          <chr>      <chr>                    
 1        1       4 TRUE     AIE8N9U317ZBM  0449819906 "Contains some interesti…
 2        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 "I'm a fairly experience…
 3        3       4 TRUE     A278N8QX9TY2OS 0449819906 "Great book but the inde…
 4        4       5 TRUE     A123W8HIK76XCN 0449819906 "I purchased the Kindle …
 5        5       5 TRUE     A2A6MZ2QB4AE0L 0449819906 "Very well laid out and …
 6        6       5 TRUE     A2OIU84NPVKIWN 0449819906 "Beginning her career as…
 7        7       5 TRUE     AKIV5VMRUZK1K  0449819906 "This is a terrific stit…
 8        8       4 TRUE     A2BQ7YGPNCQSO4 0449819906 "The book needs to be co…
 9        9       5 TRUE     A2JNO9OR8FGNR4 0449819906 "I really am enjoying th…
10       10       5 TRUE     A1IFF9F3XA9X1I 0449819906 "Just received this book…
# ℹ 49,990 more rows

shortdata_tokens <- shortdata |> 
  unnest_tokens(word, reviewText) |> 
  anti_join(stop_words, by = "word")

shortdata_tokens

# A tibble: 574,423 × 6
   reviewID overall verified reviewerID     asin       word       
      <int>   <int> <lgl>    <chr>          <chr>      <chr>      
 1        1       4 TRUE     AIE8N9U317ZBM  0449819906 stitches   
 2        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 fairly     
 3        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 experienced
 4        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 knitter    
 5        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 color      
 6        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 color      
 7        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 block      
 8        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 intarsia   
 9        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 vein       
10        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 fair       
# ℹ 574,413 more rows

Does sentiment correlate with reviews? Use the afinn lexicon to calculate a sentiment score for each review, normalizing by the number of lexicon words in each review. Visualize the distribution of sentiment scores for each rating and calculate the mean sentiment score for each review category. What do you observe?

afinn <- get_sentiments("afinn")

shortdata_sentiment <- shortdata_tokens |> 
  inner_join(afinn, by = "word")

shortdata_sentiment

# A tibble: 81,662 × 7
   reviewID overall verified reviewerID     asin       word      value
      <int>   <int> <lgl>    <chr>          <chr>      <chr>     <dbl>
 1        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 block        -1
 2        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 fair          2
 3        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 loved         3
 4        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 love          3
 5        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 strength      2
 6        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 fresh         1
 7        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 tears        -2
 8        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 frustrate    -2
 9        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 nifty         2
10        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 love          3
# ℹ 81,652 more rows

Next we will calculate the sentiment scores for each review by summing the afinn scores. We’ll divide the summed score by the count of words in each review in order to normalize the sentiment scores (this will avoid skewing the scores by review length).

shortdata_sentiment_reviewscores <- shortdata_sentiment |> 
  group_by(reviewID, overall) |> 
  summarise(sentimentscore = sum(value) / n(), .groups = "drop")

shortdata_sentiment_reviewscores

# A tibble: 33,579 × 3
   reviewID overall sentimentscore
      <int>   <int>          <dbl>
 1        2       5           1.09
 2        3       4          -3   
 3        4       5           4   
 4        5       5           2   
 5        6       5           1.08
 6        7       5           2.67
 7        8       4           3   
 8        9       5           1.8 
 9       10       5           2.25
10       11       5           3   
# ℹ 33,569 more rows

ggplot(shortdata_sentiment_reviewscores, 
       aes(x = factor(overall), y = sentimentscore)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sentiment Scores for each Rating",
    x = "Review Rating", y = "Sentiment Score")

Next we’ll calculate and visualize the mean sentiment score for each rating. It’s very interesting that the sentiment score of ‘0’ is the median for a review rating of 2 – I might’ve thought that on a scale of 1-5, 3 would be more likely to have a median sentiment score of 0. It’s very interesting that a review rating of 1, 4, and 5 show outliers while 2 and 3 do not. It’s especially interesting that the distribution in the interquartile range is broadest for the review score of 3. After thinking about it, I wonder how much ‘storytelling’ language goes into these types of reviews, in particular for ‘extreme’ review ratings of 1 and 5. For example, I’ve seen a fair share of online reviews that go something like, “frying eggs used to be a nightmare - my old nonstick skillet was the worst. But this new one is amazing and solves all my problems! Five stars!” so it recalls the ‘negative’ language of ‘before I got this item’ and that sentiment is still part of the review. I also have seen reviews that go the other way, like “I purchased a pair of these mittens every year for the past 5 years and loved them, but now it seems it’s a new manufacturer and they’re terrible - itchy and thin - 1 star!”

meanSentiment_byrating <- shortdata_sentiment_reviewscores |> 
  group_by(overall) |> 
  summarise(meanSentiment = mean(sentimentscore), .groups = "drop")

meanSentiment_byrating |> 
  ggplot(aes(x = factor(overall), y = meanSentiment)) +
  geom_col(fill = "burlywood", color = "saddlebrown") +
  ## I tried to make the columns evoke Amazon packages :D
  # geom_point(size = 8, color = "turquoise") +
  labs(
    title = "Mean sentiment score by review rating",
    x = "Review Rating", y = "Mean sentiment score"
  )

When we look simply at the mean sentiment score by review rating I see a trend that ‘makes sense’ with much less of the nuance as in the boxplot above. As expected, the mean sentiment score of a 1-star review is the most negative, and 2 has much ‘neutral’ language, followed by a growth up to 5 which has the highest mean sentiment score.

Reviewer Personalities: For each reviewer, compute the number of ratings, the mean sentiment, and the mean review score. Filter for reviewers who have written more than 10 reviews, and plot the relationship between mean rating and mean sentiment. What do you observe?

First I need to get the reviewerIDs back from the prior step – the sentiment review scores dataframe no longer had the reviewer IDs.

sentiment_reviewscores_reviewers <- shortdata_sentiment_reviewscores |> 
  left_join(shortdata_tokens |> select(reviewID, reviewerID), by = "reviewID")

sentiment_reviewscores_reviewers

# A tibble: 515,463 × 4
   reviewID overall sentimentscore reviewerID    
      <int>   <int>          <dbl> <chr>         
 1        2       5           1.09 A3ECOW0TWLH9V6
 2        2       5           1.09 A3ECOW0TWLH9V6
 3        2       5           1.09 A3ECOW0TWLH9V6
 4        2       5           1.09 A3ECOW0TWLH9V6
 5        2       5           1.09 A3ECOW0TWLH9V6
 6        2       5           1.09 A3ECOW0TWLH9V6
 7        2       5           1.09 A3ECOW0TWLH9V6
 8        2       5           1.09 A3ECOW0TWLH9V6
 9        2       5           1.09 A3ECOW0TWLH9V6
10        2       5           1.09 A3ECOW0TWLH9V6
# ℹ 515,453 more rows

Now we can calculate the number of reviews, mean sentiment score, and mean rating for each reviewer.

reviewers <- sentiment_reviewscores_reviewers |> 
  group_by(reviewerID) |> 
  summarise(
    totalReviews = n(),
    meanSentiment = mean(sentimentscore),
    meanRating = mean(overall)
  ) |> 
  filter(totalReviews > 10)

reviewers |> 
  ggplot(aes(x = meanRating, y = meanSentiment)) +
  geom_point(color = "saddlebrown", alpha = 0.5) +
  labs(
    title = "Relationship between Reviewers' mean rating and mean sentiment")

This is a bit strange but also makes sense considering some of my observations noted in the previous part of the exercise - that sometimes positive sentiments appear in negative reviews, and vice versa, so there’s less ‘extreme’ on both ends of the spectru, and more ‘in between’. My initial observation is that there is a broader range of mean sentiment scores for the reviewers with a mean rating of 5. it seems like there’s a vaguely (perhaps) linear relationship between the meanSentiment and meanRating though I can’t say for certain if there would be other ways of getting to a more valid ‘mean’ like perhaps I needed to normalize like in previous steps. It is interesting that most sentiment scores tend to hover around the neutral-to-positive range, visible in the space between the meanRating of 4 and 5, and the meanSentiment above 0 but less than 2.5.

Problem 2: Words with high relative frequency

As your starting point, take the tokenized data frame that has been filtered to remove stop words, but hasn’t been joined with the sentiment lexicon data. For each item (asin), use the bind_tf_idf function to find the word that occurs in the reviews of that item with the highest frequency relative to the frequency of words in the entire review test dataset.

First, I will use the bind_tf_idf() function to compute the tf-idf (term frequency - inverse document frequency) scores for each word in each asin item.

shortdata_tfidf <- shortdata_tokens |> 
  count(asin, word, sort = TRUE) |> 
  bind_tf_idf(word, asin, n)

shortdata_tfidf

# A tibble: 277,163 × 6
   asin       word        n     tf   idf tf_idf
   <chr>      <chr>   <int>  <dbl> <dbl>  <dbl>
 1 B0001DUD9O yarn     1306 0.0806  2.31 0.186 
 2 B0001DUD9O color     559 0.0345  1.08 0.0373
 3 B000F7DPEQ machine   460 0.0791  2.10 0.166 
 4 B000B7Q9KM dye       451 0.0520  3.69 0.192 
 5 B000BY4Q5K yarn      405 0.0882  2.31 0.204 
 6 B000XE3FGO machine   398 0.0698  2.10 0.147 
 7 B0001DUD9O red       376 0.0232  2.37 0.0550
 8 B0001DUD9O heart     336 0.0207  3.44 0.0713
 9 B000980L02 pom       335 0.0288  5.14 0.148 
10 B000B7Q9KM color     327 0.0377  1.08 0.0408
# ℹ 277,153 more rows

Next we’ll find the word that occurs in the reviews of that item with the highest frequency relative to the frequency of words in the entire review test dataset.

topwords_asin <- shortdata_tfidf |> 
  group_by(asin) |> 
  slice_max(tf_idf, n = 1) |> 
  ungroup()

topwords_asin

# A tibble: 1,775 × 6
   asin       word        n     tf   idf tf_idf
   <chr>      <chr>   <int>  <dbl> <dbl>  <dbl>
 1 0449819906 book       21 0.0464  2.29  0.106
 2 048625531X mazes       2 0.04    7.45  0.298
 3 0486473082 book        7 0.156   2.29  0.356
 4 0715329278 book       13 0.0599  2.29  0.137
 5 0804844844 creases     2 0.0606  5.14  0.312
 6 0823013626 donna      10 0.0252  6.75  0.170
 7 0848724666 book       26 0.0942  2.29  0.216
 8 0848734270 recipes     7 0.111   5.37  0.596
 9 0887248845 jesus       3 0.0316  7.45  0.235
10 0982094825 book        4 0.133   2.29  0.305
# ℹ 1,765 more rows

Select five items from the dataset (either at random or by hand) and look up the asin code for those items on Amazon.com. In each of these cases, does the highest relative frequency word correspond to the identify or type of the item that you chose? You may not be able to find every single item, but I was able to find a solid majority of the ones I searched for by searching amazon.com for the asin.

Yes, these all seem to correspond to the identiy or type of the item: - asin 0804844844, word ‘creases’: is an Origami Paper - Cherry Blossom Patterns. This totally makes sense for the word ‘creases’ to be the word that’s most unique to this item’s reviews relative to the entire dataset. - asin 0823013626, word ‘donna’: this book about polymer clay is written by a person named Donna Kato. So this also makes a ton of sense. - asin 048625531X, word ‘mazes’: this is a book caled ‘Easy Mazes Activity Book’ so once again, yes, totally makes sense. - asin 0887248845, word ‘jesus’: this is a set of ‘dazzle stickers crosses’ in the shape of religious crosses that absolutely make sense to have a connection like this, totally makes sense. - asin 1601409788, word ‘handiwork’: this is a book called ‘Leisure Arts Crochet: pocket guide’ – like those I selected previously, yes, this totally makes sense as corresponding to the item.

Problem 3: Bigrams and Sentiment

Consider the two negative words not and don’t. Starting from the original dataset, tokenize the data into bigrams. Then calculate the frequency of bigrams that start with either not or don’t. What are the 10 most common words occurring after not and after don’t? What are their sentiment values according to the afinn lexicon?

bigrams <- shortdata  |> 
  unnest_tokens(bigram, reviewText, token = "ngrams", n = 2)

OK, now that we’ve created bigrams, we’ll filter them to keep only those that start with “not” or “don’t”. To do this we will separate each bigram into its first and second word (so we see what comes after “not” and “don’t”) and then filter.

bigrams_first2words <- bigrams |> 
  separate(bigram, into = c("word1", "word2"), sep = " ")

notdont_bigrams <- bigrams_first2words |> 
  filter(word1 %in% c("not", "don't"))

notdont_bigrams

# A tibble: 13,953 × 7
   reviewID overall verified reviewerID     asin       word1 word2    
      <int>   <int> <lgl>    <chr>          <chr>      <chr> <chr>    
 1        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 not   frustrate
 2        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 don't want     
 3        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 not   a        
 4        5       5 TRUE     A2A6MZ2QB4AE0L 0449819906 not   too      
 5        6       5 TRUE     A2OIU84NPVKIWN 0449819906 not   a        
 6        6       5 TRUE     A2OIU84NPVKIWN 0449819906 don't make     
 7        6       5 TRUE     A2OIU84NPVKIWN 0449819906 not   the      
 8        6       5 TRUE     A2OIU84NPVKIWN 0449819906 not   be       
 9        6       5 TRUE     A2OIU84NPVKIWN 0449819906 not   over     
10        6       5 TRUE     A2OIU84NPVKIWN 0449819906 not   found    
# ℹ 13,943 more rows

Since it looks like already we see a lot of the stopwords here in word2, we’ll remove them with an anti join like we did in steps above.

notdont_bigrams_nostopwords <- notdont_bigrams |> 
  anti_join(stop_words, by = c("word2" = "word")) 

notdont_bigrams_nostopwords

# A tibble: 4,873 × 7
   reviewID overall verified reviewerID     asin       word1 word2    
      <int>   <int> <lgl>    <chr>          <chr>      <chr> <chr>    
 1        2       5 TRUE     A3ECOW0TWLH9V6 0449819906 not   frustrate
 2        6       5 TRUE     A2OIU84NPVKIWN 0449819906 not   found    
 3       16       2 TRUE     AC0J24YW9X2FO  0486473082 not   easy     
 4       26       5 TRUE     A3L0W8SH1FW16P 0715329278 don't bore     
 5       31       3 TRUE     A3JPKWIYSWKADO 0804844844 don't intend   
 6       31       3 TRUE     A3JPKWIYSWKADO 0804844844 don't sharpen  
 7       63       1 FALSE    A2O6SU5YDVSDN4 0848724666 not   donate   
 8      115       5 TRUE     A1K1B5YB9GR3VH 0988174979 not   liking   
 9      122       5 FALSE    A2ZYE67M5YI6IW 0988174979 not   regret   
10      151       5 TRUE     A3UPL81MZO2CJR 1571208216 not   wrinkle  
# ℹ 4,863 more rows

This looks much better, so we’ll move on to identify the most common words (other than stop words) after “not” and “don’t”.

notdont_bigrams_mostcommon <- notdont_bigrams_nostopwords |> 
  count(word1, word2, sort = TRUE) |> 
  group_by(word1) |> 
  slice_max(n, n = 10) |> 
  ungroup()

notdont_bigrams_mostcommon

# A tibble: 20 × 3
   word1 word2          n
   <chr> <chr>      <int>
 1 don't buy           45
 2 don't waste         45
 3 don't mind          38
 4 don't care          35
 5 don't expect        34
 6 don't feel          27
 7 don't leave         22
 8 don't lose          22
 9 don't hold          21
10 don't understand    21
11 not   recommend    103
12 not   buy           93
13 not   worth         88
14 not   fit           65
15 not   cut           57
16 not   happy         41
17 not   hold          41
18 not   bad           37
19 not   easy          37
20 not   stay          37

Now to find the sentiment of each word2 according to the afinn lexicon we will do the following to get sentiment values for individual words.

notdont_bigrams_mostcommon_sentiments <- notdont_bigrams_mostcommon |> 
  left_join(afinn, by = c("word2" = "word")) |> 
  select(word1, word2, n, value)

notdont_bigrams_mostcommon_sentiments

# A tibble: 20 × 4
   word1 word2          n value
   <chr> <chr>      <int> <dbl>
 1 don't buy           45    NA
 2 don't waste         45    -1
 3 don't mind          38    NA
 4 don't care          35     2
 5 don't expect        34    NA
 6 don't feel          27    NA
 7 don't leave         22    -1
 8 don't lose          22    NA
 9 don't hold          21    NA
10 don't understand    21    NA
11 not   recommend    103     2
12 not   buy           93    NA
13 not   worth         88     2
14 not   fit           65     1
15 not   cut           57    -1
16 not   happy         41     3
17 not   hold          41    NA
18 not   bad           37    -3
19 not   easy          37     1
20 not   stay          37    NA

Strangely, it looks like there is no value associated with a number of the most common words after not and don’t. I am not sure why someting like “expect” or “understand” would be missing from afinn. In any case, among those that do have a value that’s got a sentiment score in afinn, for “don’t” we see two “-1” and one “2”, so nothing exceedingly negative but definitely leaning negative. We see more values associated (but not for all of the top 10) for the words that come after “not”, with the most negative being “bad” (score of -3) and the most positive being “happy” (score of 3), which interestingly make the phrases “not bad” and “not happy”, very very different sentiments on the whole. It’s interesting to observe how different breaking things into individual words changes the ‘sentiment score’ potentially, like the most common word after “not” is “recommend” and while “recommend” has a sentiment score of 2, the phrase “not recommend” as in “I would not recommend this item” is not a positive statement. compare that to “not bad” which isn’t as negative (in my opinion) as “not recommend” or “not happy” or “not worth” – another which has a word2 with a relatively positive value but the phrase “not worth” (as in “not worth the money” or something) does not. Super interesting to see how this all plays into sentiment analysis. Words are not islands!

Pick the most commonly occurring bigram where the first word is not or don’t and the afinn sentiment of the second word is 2 or greater. Compute the mean rating of the reviews containing this bigram. How do they compare the average review score over the entire dataset?

notdont_bigrams_select <- notdont_bigrams |> 
  filter((word1 == "not" & word2 == "recommend") |
           (word1 == "don't" & word2 == "care"))

notdont_bigrams_select

# A tibble: 138 × 7
   reviewID overall verified reviewerID     asin       word1 word2    
      <int>   <int> <lgl>    <chr>          <chr>      <chr> <chr>    
 1      208       4 TRUE     A2T639IINN6BTG 1584799684 not   recommend
 2      512       1 TRUE     A1IS1EIF6H773T B0000302YM not   recommend
 3      604       1 TRUE     A1C8ZZ1C413OGO B00004T7R3 not   recommend
 4      961       1 FALSE    A3M9TJZUV6747L B00004THXH not   recommend
 5      976       1 FALSE    A3M9TJZUV6747L B00004THXH not   recommend
 6      977       1 FALSE    A3M9TJZUV6747L B00004THXH not   recommend
 7     1532       4 FALSE    AHRNDSKWH1YBT  B00006B8FS don't care     
 8     2053       4 TRUE     AI43VKPN5NF7D  B00006IFGJ don't care     
 9     2331       1 TRUE     A3N5V4QQBZJ21D B00006IFFX not   recommend
10     2332       1 TRUE     A3N5V4QQBZJ21D B00006IFFX not   recommend
# ℹ 128 more rows

reviewswith_notdont_bigrams_select <- notdont_bigrams_select |> 
  inner_join(shortdata, by = "reviewID")

reviewswith_notdont_bigrams_select

# A tibble: 138 × 12
   reviewID overall.x verified.x reviewerID.x   asin.x     word1 word2 overall.y
      <int>     <int> <lgl>      <chr>          <chr>      <chr> <chr>     <int>
 1      208         4 TRUE       A2T639IINN6BTG 1584799684 not   reco…         4
 2      512         1 TRUE       A1IS1EIF6H773T B0000302YM not   reco…         1
 3      604         1 TRUE       A1C8ZZ1C413OGO B00004T7R3 not   reco…         1
 4      961         1 FALSE      A3M9TJZUV6747L B00004THXH not   reco…         1
 5      976         1 FALSE      A3M9TJZUV6747L B00004THXH not   reco…         1
 6      977         1 FALSE      A3M9TJZUV6747L B00004THXH not   reco…         1
 7     1532         4 FALSE      AHRNDSKWH1YBT  B00006B8FS don't care          4
 8     2053         4 TRUE       AI43VKPN5NF7D  B00006IFGJ don't care          4
 9     2331         1 TRUE       A3N5V4QQBZJ21D B00006IFFX not   reco…         1
10     2332         1 TRUE       A3N5V4QQBZJ21D B00006IFFX not   reco…         1
# ℹ 128 more rows
# ℹ 4 more variables: verified.y <lgl>, reviewerID.y <chr>, asin.y <chr>,
#   reviewText <chr>

The inner join led to some repeated columns but since it’s not the ‘final’ form of a dataset I am utilizing I will leave it as-is.

meanratings_bigrams_select <- reviewswith_notdont_bigrams_select |> 
  group_by(word1, word2) |> 
  summarise(meanRating = mean(overall.x, na.rm = TRUE)) |> 
  ungroup()

`summarise()` has grouped output by 'word1'. You can override using the
`.groups` argument.

meanratings_bigrams_select

# A tibble: 2 × 3
  word1 word2     meanRating
  <chr> <chr>          <dbl>
1 don't care            3.34
2 not   recommend       2.40

Now we want to compare the above to overall mean ratings.

overall_meanratings <- shortdata |> 
  summarise(meanRating = mean(overall, na.rm = TRUE))

overall_meanratings

# A tibble: 1 × 1
  meanRating
       <dbl>
1       4.57

Interestingly we see that the overall mean rating of the dataset is 4.57 which is much higher than the overall mean rating for the reviews that start with “don’t care” or “not recommend”. I am not surprised that “not recommend” has a lower mean rating than “don’t care” because “not recommend” is more consistently (I think) a phrase that has a negative sentiment, other than something along the sentiment of “I can’t recommend this enough!”. But something like “don’t care” is more ambiguous - “don’t care what other reviewers say, this is AMAZING!”. Though I think it completely fits with our understanding that any reviews that start with “don’t care” or “not recommend” are associated with lower overall reviews compared with the mean rating overall across the entire dataset. Very interesting!