Problem Set 5 Template

# Remember to install packages first
library(tm)

## Loading required package: NLP

library(SnowballC)
library(wordcloud)

## Loading required package: RColorBrewer

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(syuzhet)
library(lda)
library(ldatuning)
library(topicmodels)

For this problem set you will be working with the Amazon Fine Foods Reviews data (https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews). We will limit our analysis to just the text of the reviews and the score given by the user (i.e., number of starts). Perform the following four steps to conduct text analysis using the packages discussed in class.

##Part 1: Data Cleaning & Preprocessing

Load the data and select the columns: HelpfulnessNumerator, HelpfulnessDenominator, Score, and Text.

# Remember to set working directory, then use read.csv()
# Load the dataset

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

reviews <- read_csv("ReviewsData.csv")

## Rows: 2739 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ProductId, UserId, ProfileName, Summary, Text
## dbl (5): Id, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Select only the required columns
reviews_clean <- reviews %>%
  select(HelpfulnessNumerator, HelpfulnessDenominator, Score, Text)

# Preview the data
glimpse(reviews_clean)

## Rows: 2,739
## Columns: 4
## $ HelpfulnessNumerator   <dbl> 844, 866, 808, 580, 491, 559, 538, 536, 524, 48…
## $ HelpfulnessDenominator <dbl> 923, 878, 815, 593, 569, 562, 544, 539, 536, 49…
## $ Score                  <dbl> 3, 5, 5, 1, 3, 5, 5, 5, 2, 5, 5, 1, 5, 4, 1, 5,…
## $ Text                   <chr> "I ordered one of these Fresh \"Whole\" Rabbits…

Create a variable Helpfulness Percentage by dividing the Numerator by the Denominator column.

# Create Helpfulness Percentage variable
reviews_clean <- reviews_clean %>%
  mutate(HelpfulnessPercent = if_else(HelpfulnessDenominator == 0,
                                      NA_real_,
                                      HelpfulnessNumerator / HelpfulnessDenominator))

# Preview the updated data
glimpse(reviews_clean)

## Rows: 2,739
## Columns: 5
## $ HelpfulnessNumerator   <dbl> 844, 866, 808, 580, 491, 559, 538, 536, 524, 48…
## $ HelpfulnessDenominator <dbl> 923, 878, 815, 593, 569, 562, 544, 539, 536, 49…
## $ Score                  <dbl> 3, 5, 5, 1, 3, 5, 5, 5, 2, 5, 5, 1, 5, 4, 1, 5,…
## $ Text                   <chr> "I ordered one of these Fresh \"Whole\" Rabbits…
## $ HelpfulnessPercent     <dbl> 0.9144095, 0.9863326, 0.9914110, 0.9780776, 0.8…

Remove stop words and punctuation; convert to lowercase.

library(tidyverse)
library(tidytext)
library(textclean)
# Load built-in list of stop words
data("stop_words")
reviews_clean <- reviews_clean %>%
  mutate(Text = str_replace_all(Text, "<.*?>", ""))  # remove HTML tags like <br>, <p>, etc.


# Tokenize, remove punctuation, convert to lowercase, remove stop words
reviews_tokens <- reviews_clean %>%
  select(Score, Text) %>%
  unnest_tokens(word, Text) %>%             # converts text to lowercase & tokenizes
  filter(!word %in% stop_words$word) %>%    # remove stop words
  filter(!str_detect(word, "^[0-9]+$"))     # remove numbers (optional)

# Preview cleaned tokens
head(reviews_tokens)

## # A tibble: 6 × 2
##   Score word   
##   <dbl> <chr>  
## 1     3 fresh  
## 2     3 rabbits
## 3     3 arrived
## 4     3 head   
## 5     3 fur    
## 6     3 insides

Create a word cloud for the most positive (5 star) and most negative (1 star) reviews.

library(tidyverse)
library(tidytext)
library(wordcloud)
library(RColorBrewer)

data("stop_words")

# 1-star reviews
freq_1star <- reviews_clean %>%
  filter(Score == 1) %>%
  unnest_tokens(word, Text) %>%
  filter(!word %in% stop_words$word, !str_detect(word, "^[0-9]+$")) %>%
  count(word, sort = TRUE)

# 5-star reviews
freq_5star <- reviews_clean %>%
  filter(Score == 5) %>%
  unnest_tokens(word, Text) %>%
  filter(!word %in% stop_words$word, !str_detect(word, "^[0-9]+$")) %>%
  count(word, sort = TRUE)

# Word cloud for 1-star reviews
wordcloud(words = freq_1star$word,
          freq = freq_1star$n,
          max.words = 80,
          colors = brewer.pal(8, "Reds"))

# Word cloud for 5-star reviews
wordcloud(words = freq_5star$word,
          freq = freq_5star$n,
          max.words = 80,
          colors = brewer.pal(8, "Greens"))

## Warning in wordcloud(words = freq_5star$word, freq = freq_5star$n, max.words =
## 80, : coffee could not be fit on page. It will not be plotted.

Part 2: Sentiment Analysis

Calculate sentiment scores for each review using the get_sentiment() function with the bing method.

# Add a ReviewID BEFORE tokenizing
reviews_clean <- reviews_clean %>%
  mutate(ReviewID = row_number())

# Tokenize and clean
reviews_tokens <- reviews_clean %>%
  select(ReviewID, Score, Text) %>%
  unnest_tokens(word, Text) %>%
  filter(!word %in% stop_words$word,
         !str_detect(word, "^[0-9]+$"))

# Sentiment using Bing
reviews_sentiment <- reviews_tokens %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  mutate(sentiment_value = if_else(sentiment == "positive", 1, -1)) %>%
  group_by(ReviewID) %>%
  summarise(sentiment_score = sum(sentiment_value), .groups = "drop")

# Merge back with original data to see Score and Text
reviews_with_sentiment <- reviews_clean %>%
  left_join(reviews_sentiment, by = "ReviewID")

# Print first few rows
print(head(reviews_with_sentiment))

## # A tibble: 6 × 7
##   HelpfulnessNumerator HelpfulnessDenominator Score Text      HelpfulnessPercent
##                  <dbl>                  <dbl> <dbl> <chr>                  <dbl>
## 1                  844                    923     3 "I order…              0.914
## 2                  866                    878     5 "see upd…              0.986
## 3                  808                    815     5 "I purch…              0.991
## 4                  580                    593     1 "This pr…              0.978
## 5                  491                    569     3 "Coconut…              0.863
## 6                  559                    562     5 "This Ec…              0.995
## # ℹ 2 more variables: ReviewID <int>, sentiment_score <dbl>

Make a plot of the top emotions for poor reviews (1 and 2 stars) and positive reviews (4 and 5 stars)

library(tidyverse)
library(tidytext)

# Filter only poor (1-2 stars) and positive (4-5 stars) reviews
reviews_emotion <- reviews_clean %>%
  filter(Score %in% c(1, 2, 4, 5)) %>%
  select(ReviewID, Score, Text) %>%
  unnest_tokens(word, Text) %>%
  filter(!word %in% stop_words$word,
         !str_detect(word, "^[0-9]+$"))

# Join with NRC lexicon
nrc <- get_sentiments("nrc") %>%
  filter(!sentiment %in% c("positive", "negative"))  # Keep only emotions

reviews_emotion_sentiment <- reviews_emotion %>%
  inner_join(nrc, by = "word")

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 17 of `x` matches multiple rows in `y`.
## ℹ Row 3247 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# Group and count emotions by review type
emotion_counts <- reviews_emotion_sentiment %>%
  mutate(ReviewType = case_when(
    Score %in% c(1, 2) ~ "Poor (1-2 stars)",
    Score %in% c(4, 5) ~ "Positive (4-5 stars)"
  )) %>%
  count(ReviewType, sentiment) %>%
  group_by(ReviewType) %>%
  top_n(6, n)  # Top 6 emotions per group

# Plot
ggplot(emotion_counts, aes(x = reorder(sentiment, n), y = n, fill = ReviewType)) +
  geom_col(show.legend = TRUE, position = "dodge") +
  facet_wrap(~ ReviewType, scales = "free_y") +
  coord_flip() +
  labs(title ="Top Emotions in Poor vs Positive Reviews",
       x = "Emotion",
       y = "Word Count") +
  theme_minimal()

Make a plot of the top emotions for unhelpful reviews (under 25%) and for helpful reviews (over 75%)

# Load NRC emotions
nrc <- get_sentiments("nrc") %>%
  filter(!sentiment %in% c("positive", "negative"))  # Keep only emotions

# Load your data
reviews <- read_csv("ReviewsData.csv")

## Rows: 2739 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ProductId, UserId, ProfileName, Summary, Text
## dbl (5): Id, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Clean + create HelpfulnessPercent + add ReviewID
reviews_clean <- reviews %>%
  select(HelpfulnessNumerator, HelpfulnessDenominator, Score, Text) %>%
  mutate(
    HelpfulnessPercent = if_else(HelpfulnessDenominator == 0, NA_real_,
                                 HelpfulnessNumerator / HelpfulnessDenominator),
    ReviewID = row_number(),
    Text = str_replace_all(Text, "<.*?>", "")  # remove HTML tags
  )

# Filter for helpfulness extremes
reviews_helpfulness <- reviews_clean %>%
  filter(!is.na(HelpfulnessPercent)) %>%
  mutate(HelpfulnessGroup = case_when(
    HelpfulnessPercent < 0.25 ~ "Unhelpful (<25%)",
    HelpfulnessPercent > 0.75 ~ "Helpful (>75%)",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(HelpfulnessGroup))  # remove middle group

# Tokenize and clean
data("stop_words")
reviews_tokens <- reviews_helpfulness %>%
  unnest_tokens(word, Text) %>%
  filter(!word %in% stop_words$word,
         !str_detect(word, "^[0-9]+$"))

# Join with NRC emotions
reviews_emotions <- reviews_tokens %>%
  inner_join(nrc, by = "word")

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 7 of `x` matches multiple rows in `y`.
## ℹ Row 4888 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# Count top emotions by helpfulness group
emotion_counts <- reviews_emotions %>%
  count(HelpfulnessGroup, sentiment) %>%
  group_by(HelpfulnessGroup) %>%
  top_n(6, n)  # Top 6 per group

# Plot
ggplot(emotion_counts, aes(x = reorder(sentiment, n), y = n, fill = HelpfulnessGroup)) +
  geom_col(show.legend = TRUE, position = "dodge") +
  facet_wrap(~ HelpfulnessGroup, scales = "free_y") +
  coord_flip() +
  labs(title = "Top Emotions in Helpful vs Unhelpful Reviews",
       x = "Emotion",
       y = "Word Count") +
  theme_minimal()

##Part 3: Topic Modeling (LDA)

Create a Document-Term Matrix.

reviews_tokens <- reviews_clean %>%
  unnest_tokens(word, Text) %>%
  filter(!word %in% stop_words$word,
         !str_detect(word, "^[0-9]+$")) %>%
  count(ReviewID, word, sort = TRUE)

# Create Document-Term Matrix
dtm <- reviews_tokens %>%
  cast_dtm(document = ReviewID, term = word, value = n)

# Check the result
dtm

## <<DocumentTermMatrix (documents: 2739, terms: 22708)>>
## Non-/sparse entries: 132710/62064502
## Sparsity           : 100%
## Maximal term length: 39
## Weighting          : term frequency (tf)

Apply LDA with k = 5 topics.

library(topicmodels)

# Apply LDA with k = 5 topics
lda_model <- LDA(dtm, k = 5, control = list(seed = 1234))

# Check the model
lda_model

## A LDA_VEM topic model with 5 topics.

Show the top 10 terms for each topic.

# Show the top 10 terms for each topic
top_terms <- terms(lda_model, 10)

# Print top terms for each topic
for (i in 1:ncol(top_terms)) {
  cat("\nTopic", i, "Top Terms:\n")
  print(top_terms[, i])
}

## 
## Topic 1 Top Terms:
##  [1] "food"    "taste"   "dog"     "dogs"    "product" "noodles" "cat"    
##  [8] "water"   "flavor"  "eat"    
## 
## Topic 2 Top Terms:
##  [1] "coffee"  "cup"     "box"     "cups"    "product" "amazon"  "pods"   
##  [8] "flavor"  "keurig"  "time"   
## 
## Topic 3 Top Terms:
##  [1] "sugar"     "oil"       "chocolate" "product"   "taste"     "free"     
##  [7] "flavor"    "mix"       "fat"       "coconut"  
## 
## Topic 4 Top Terms:
##  [1] "milk"    "water"   "product" "organic" "oil"     "honey"   "coconut"
##  [8] "time"    "food"    "taste"  
## 
## Topic 5 Top Terms:
##  [1] "tea"     "product" "amazon"  "green"   "box"     "taste"   "time"   
##  [8] "bag"     "price"   "buy"

Interpret what each topic might represent (e.g., complaints about packaging, praise for taste, etc.).

Answer:

Part 4: Reflection Questions

How do sentiment scores align with the star ratings?

The analysis compares emotions in poor (1-2 star) versus positive (4-5 star) reviews. The results indicate an alignment, as different emotions are prominent in each category. For instance, positive reviews show higher counts for emotions like “joy” and “anticipation,” while poor reviews show higher counts for “sadness,” “fear,” and “disgust”.

How do sentiment scores align with helpfulness?

The analysis plots the top emotions found in reviews categorized as “Unhelpful (<25%)” versus “Helpful (>75%)” based on the HelpfulnessPercent variable.
Based on the plot comparing the top 6 emotions (like trust, joy, anticipation, sadness, fear, surprise), there isn’t a strongly distinct difference in the types of top emotions between helpful and unhelpful reviews; similar emotions appear prominent in both groups, although their counts vary.

What are the most common topics found in the reviews? The LDA topic modeling identified 5 topics. Based on the top terms for each topic, the common themes are:

Topic 1: Pet food (dog, cat) and general food items/taste.
Topic 2: Coffee and related products (cup, pods, Keurig), packaging/ordering (box, Amazon).
Topic 3: Ingredients, health aspects (sugar, oil, chocolate, free, fat, coconut), taste/flavor.
Topic 4: Natural/organic products, liquids (milk, water, honey, coconut oil), general food terms.
Topic 5: Tea products, purchasing context (Amazon, price, buy, bag, box). Overall, common topics revolve around specific products (coffee, tea, pet food), ingredients, taste/flavor, health attributes, and the purchasing experience (packaging, price, Amazon).

Were there any surprises in the sentiment or topics?

Sentiment: It might be considered surprising that “trust” appears as a top emotion in both positive (4-5 stars) and poor (1-2 stars) reviews, suggesting its significance across different rating levels. Additionally, the similarity in the types of top emotions found in the helpfulness comparison plot might be unexpected if one assumed very different emotional drivers for helpful versus unhelpful reviews.
Topics: The topics generally cover expected areas for food reviews. The clear distinction of coffee (Topic 2) and tea (Topic 5) into separate topics highlights their prominence in the dataset. The inclusion of pet food terms (Topic 1) alongside human food items is also noteworth