# Remember to install packages first
library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(syuzhet)
library(lda)
library(ldatuning)
library(topicmodels)
For this problem set you will be working with the Amazon Fine Foods Reviews data (https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews). We will limit our analysis to just the text of the reviews and the score given by the user (i.e., number of starts). Perform the following four steps to conduct text analysis using the packages discussed in class.
##Part 1: Data Cleaning & Preprocessing
Load the data and select the columns: HelpfulnessNumerator, HelpfulnessDenominator, Score, and Text.
# Remember to set working directory, then use read.csv()
# Load the dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
reviews <- read_csv("ReviewsData.csv")
## Rows: 2739 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ProductId, UserId, ProfileName, Summary, Text
## dbl (5): Id, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Select only the required columns
reviews_clean <- reviews %>%
select(HelpfulnessNumerator, HelpfulnessDenominator, Score, Text)
# Preview the data
glimpse(reviews_clean)
## Rows: 2,739
## Columns: 4
## $ HelpfulnessNumerator <dbl> 844, 866, 808, 580, 491, 559, 538, 536, 524, 48…
## $ HelpfulnessDenominator <dbl> 923, 878, 815, 593, 569, 562, 544, 539, 536, 49…
## $ Score <dbl> 3, 5, 5, 1, 3, 5, 5, 5, 2, 5, 5, 1, 5, 4, 1, 5,…
## $ Text <chr> "I ordered one of these Fresh \"Whole\" Rabbits…
Create a variable Helpfulness Percentage by dividing the Numerator by the Denominator column.
# Create Helpfulness Percentage variable
reviews_clean <- reviews_clean %>%
mutate(HelpfulnessPercent = if_else(HelpfulnessDenominator == 0,
NA_real_,
HelpfulnessNumerator / HelpfulnessDenominator))
# Preview the updated data
glimpse(reviews_clean)
## Rows: 2,739
## Columns: 5
## $ HelpfulnessNumerator <dbl> 844, 866, 808, 580, 491, 559, 538, 536, 524, 48…
## $ HelpfulnessDenominator <dbl> 923, 878, 815, 593, 569, 562, 544, 539, 536, 49…
## $ Score <dbl> 3, 5, 5, 1, 3, 5, 5, 5, 2, 5, 5, 1, 5, 4, 1, 5,…
## $ Text <chr> "I ordered one of these Fresh \"Whole\" Rabbits…
## $ HelpfulnessPercent <dbl> 0.9144095, 0.9863326, 0.9914110, 0.9780776, 0.8…
Remove stop words and punctuation; convert to lowercase.
library(tidyverse)
library(tidytext)
library(textclean)
# Load built-in list of stop words
data("stop_words")
reviews_clean <- reviews_clean %>%
mutate(Text = str_replace_all(Text, "<.*?>", "")) # remove HTML tags like <br>, <p>, etc.
# Tokenize, remove punctuation, convert to lowercase, remove stop words
reviews_tokens <- reviews_clean %>%
select(Score, Text) %>%
unnest_tokens(word, Text) %>% # converts text to lowercase & tokenizes
filter(!word %in% stop_words$word) %>% # remove stop words
filter(!str_detect(word, "^[0-9]+$")) # remove numbers (optional)
# Preview cleaned tokens
head(reviews_tokens)
## # A tibble: 6 × 2
## Score word
## <dbl> <chr>
## 1 3 fresh
## 2 3 rabbits
## 3 3 arrived
## 4 3 head
## 5 3 fur
## 6 3 insides
Create a word cloud for the most positive (5 star) and most negative (1 star) reviews.
library(tidyverse)
library(tidytext)
library(wordcloud)
library(RColorBrewer)
data("stop_words")
# 1-star reviews
freq_1star <- reviews_clean %>%
filter(Score == 1) %>%
unnest_tokens(word, Text) %>%
filter(!word %in% stop_words$word, !str_detect(word, "^[0-9]+$")) %>%
count(word, sort = TRUE)
# 5-star reviews
freq_5star <- reviews_clean %>%
filter(Score == 5) %>%
unnest_tokens(word, Text) %>%
filter(!word %in% stop_words$word, !str_detect(word, "^[0-9]+$")) %>%
count(word, sort = TRUE)
# Word cloud for 1-star reviews
wordcloud(words = freq_1star$word,
freq = freq_1star$n,
max.words = 80,
colors = brewer.pal(8, "Reds"))
# Word cloud for 5-star reviews
wordcloud(words = freq_5star$word,
freq = freq_5star$n,
max.words = 80,
colors = brewer.pal(8, "Greens"))
## Warning in wordcloud(words = freq_5star$word, freq = freq_5star$n, max.words =
## 80, : coffee could not be fit on page. It will not be plotted.
Calculate sentiment scores for each review using the get_sentiment() function with the bing method.
# Add a ReviewID BEFORE tokenizing
reviews_clean <- reviews_clean %>%
mutate(ReviewID = row_number())
# Tokenize and clean
reviews_tokens <- reviews_clean %>%
select(ReviewID, Score, Text) %>%
unnest_tokens(word, Text) %>%
filter(!word %in% stop_words$word,
!str_detect(word, "^[0-9]+$"))
# Sentiment using Bing
reviews_sentiment <- reviews_tokens %>%
inner_join(get_sentiments("bing"), by = "word") %>%
mutate(sentiment_value = if_else(sentiment == "positive", 1, -1)) %>%
group_by(ReviewID) %>%
summarise(sentiment_score = sum(sentiment_value), .groups = "drop")
# Merge back with original data to see Score and Text
reviews_with_sentiment <- reviews_clean %>%
left_join(reviews_sentiment, by = "ReviewID")
# Print first few rows
print(head(reviews_with_sentiment))
## # A tibble: 6 × 7
## HelpfulnessNumerator HelpfulnessDenominator Score Text HelpfulnessPercent
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 844 923 3 "I order… 0.914
## 2 866 878 5 "see upd… 0.986
## 3 808 815 5 "I purch… 0.991
## 4 580 593 1 "This pr… 0.978
## 5 491 569 3 "Coconut… 0.863
## 6 559 562 5 "This Ec… 0.995
## # ℹ 2 more variables: ReviewID <int>, sentiment_score <dbl>
Make a plot of the top emotions for poor reviews (1 and 2 stars) and positive reviews (4 and 5 stars)
library(tidyverse)
library(tidytext)
# Filter only poor (1-2 stars) and positive (4-5 stars) reviews
reviews_emotion <- reviews_clean %>%
filter(Score %in% c(1, 2, 4, 5)) %>%
select(ReviewID, Score, Text) %>%
unnest_tokens(word, Text) %>%
filter(!word %in% stop_words$word,
!str_detect(word, "^[0-9]+$"))
# Join with NRC lexicon
nrc <- get_sentiments("nrc") %>%
filter(!sentiment %in% c("positive", "negative")) # Keep only emotions
reviews_emotion_sentiment <- reviews_emotion %>%
inner_join(nrc, by = "word")
## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 17 of `x` matches multiple rows in `y`.
## ℹ Row 3247 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Group and count emotions by review type
emotion_counts <- reviews_emotion_sentiment %>%
mutate(ReviewType = case_when(
Score %in% c(1, 2) ~ "Poor (1-2 stars)",
Score %in% c(4, 5) ~ "Positive (4-5 stars)"
)) %>%
count(ReviewType, sentiment) %>%
group_by(ReviewType) %>%
top_n(6, n) # Top 6 emotions per group
# Plot
ggplot(emotion_counts, aes(x = reorder(sentiment, n), y = n, fill = ReviewType)) +
geom_col(show.legend = TRUE, position = "dodge") +
facet_wrap(~ ReviewType, scales = "free_y") +
coord_flip() +
labs(title ="Top Emotions in Poor vs Positive Reviews",
x = "Emotion",
y = "Word Count") +
theme_minimal()
Make a plot of the top emotions for unhelpful reviews (under 25%) and for helpful reviews (over 75%)
# Load NRC emotions
nrc <- get_sentiments("nrc") %>%
filter(!sentiment %in% c("positive", "negative")) # Keep only emotions
# Load your data
reviews <- read_csv("ReviewsData.csv")
## Rows: 2739 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ProductId, UserId, ProfileName, Summary, Text
## dbl (5): Id, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean + create HelpfulnessPercent + add ReviewID
reviews_clean <- reviews %>%
select(HelpfulnessNumerator, HelpfulnessDenominator, Score, Text) %>%
mutate(
HelpfulnessPercent = if_else(HelpfulnessDenominator == 0, NA_real_,
HelpfulnessNumerator / HelpfulnessDenominator),
ReviewID = row_number(),
Text = str_replace_all(Text, "<.*?>", "") # remove HTML tags
)
# Filter for helpfulness extremes
reviews_helpfulness <- reviews_clean %>%
filter(!is.na(HelpfulnessPercent)) %>%
mutate(HelpfulnessGroup = case_when(
HelpfulnessPercent < 0.25 ~ "Unhelpful (<25%)",
HelpfulnessPercent > 0.75 ~ "Helpful (>75%)",
TRUE ~ NA_character_
)) %>%
filter(!is.na(HelpfulnessGroup)) # remove middle group
# Tokenize and clean
data("stop_words")
reviews_tokens <- reviews_helpfulness %>%
unnest_tokens(word, Text) %>%
filter(!word %in% stop_words$word,
!str_detect(word, "^[0-9]+$"))
# Join with NRC emotions
reviews_emotions <- reviews_tokens %>%
inner_join(nrc, by = "word")
## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 7 of `x` matches multiple rows in `y`.
## ℹ Row 4888 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Count top emotions by helpfulness group
emotion_counts <- reviews_emotions %>%
count(HelpfulnessGroup, sentiment) %>%
group_by(HelpfulnessGroup) %>%
top_n(6, n) # Top 6 per group
# Plot
ggplot(emotion_counts, aes(x = reorder(sentiment, n), y = n, fill = HelpfulnessGroup)) +
geom_col(show.legend = TRUE, position = "dodge") +
facet_wrap(~ HelpfulnessGroup, scales = "free_y") +
coord_flip() +
labs(title = "Top Emotions in Helpful vs Unhelpful Reviews",
x = "Emotion",
y = "Word Count") +
theme_minimal()
##Part 3: Topic Modeling (LDA)
Create a Document-Term Matrix.
reviews_tokens <- reviews_clean %>%
unnest_tokens(word, Text) %>%
filter(!word %in% stop_words$word,
!str_detect(word, "^[0-9]+$")) %>%
count(ReviewID, word, sort = TRUE)
# Create Document-Term Matrix
dtm <- reviews_tokens %>%
cast_dtm(document = ReviewID, term = word, value = n)
# Check the result
dtm
## <<DocumentTermMatrix (documents: 2739, terms: 22708)>>
## Non-/sparse entries: 132710/62064502
## Sparsity : 100%
## Maximal term length: 39
## Weighting : term frequency (tf)
Apply LDA with k = 5 topics.
library(topicmodels)
# Apply LDA with k = 5 topics
lda_model <- LDA(dtm, k = 5, control = list(seed = 1234))
# Check the model
lda_model
## A LDA_VEM topic model with 5 topics.
Show the top 10 terms for each topic.
# Show the top 10 terms for each topic
top_terms <- terms(lda_model, 10)
# Print top terms for each topic
for (i in 1:ncol(top_terms)) {
cat("\nTopic", i, "Top Terms:\n")
print(top_terms[, i])
}
##
## Topic 1 Top Terms:
## [1] "food" "taste" "dog" "dogs" "product" "noodles" "cat"
## [8] "water" "flavor" "eat"
##
## Topic 2 Top Terms:
## [1] "coffee" "cup" "box" "cups" "product" "amazon" "pods"
## [8] "flavor" "keurig" "time"
##
## Topic 3 Top Terms:
## [1] "sugar" "oil" "chocolate" "product" "taste" "free"
## [7] "flavor" "mix" "fat" "coconut"
##
## Topic 4 Top Terms:
## [1] "milk" "water" "product" "organic" "oil" "honey" "coconut"
## [8] "time" "food" "taste"
##
## Topic 5 Top Terms:
## [1] "tea" "product" "amazon" "green" "box" "taste" "time"
## [8] "bag" "price" "buy"
Interpret what each topic might represent (e.g., complaints about packaging, praise for taste, etc.).
Answer:
The analysis compares emotions in poor (1-2 star) versus positive (4-5 star) reviews. The results indicate an alignment, as different emotions are prominent in each category. For instance, positive reviews show higher counts for emotions like “joy” and “anticipation,” while poor reviews show higher counts for “sadness,” “fear,” and “disgust”.
The analysis plots the top emotions found in reviews categorized as
“Unhelpful (<25%)” versus “Helpful (>75%)” based on the
HelpfulnessPercent variable.
Based on the plot comparing the top 6 emotions (like trust, joy,
anticipation, sadness, fear, surprise), there isn’t a strongly distinct
difference in the types of top emotions between helpful and unhelpful
reviews; similar emotions appear prominent in both groups, although
their counts vary.
Topic 1: Pet food (dog, cat) and general food items/taste.
Topic 2: Coffee and related products (cup, pods, Keurig),
packaging/ordering (box, Amazon).
Topic 3: Ingredients, health aspects (sugar, oil, chocolate, free, fat,
coconut), taste/flavor.
Topic 4: Natural/organic products, liquids (milk, water, honey, coconut
oil), general food terms.
Topic 5: Tea products, purchasing context (Amazon, price, buy, bag,
box). Overall, common topics revolve around specific products (coffee,
tea, pet food), ingredients, taste/flavor, health attributes, and the
purchasing experience (packaging, price, Amazon).
Sentiment: It might be considered surprising that “trust” appears as
a top emotion in both positive (4-5 stars) and poor (1-2 stars) reviews,
suggesting its significance across different rating levels.
Additionally, the similarity in the types of top emotions found in the
helpfulness comparison plot might be unexpected if one assumed very
different emotional drivers for helpful versus unhelpful reviews.
Topics: The topics generally cover expected areas for food reviews. The
clear distinction of coffee (Topic 2) and tea (Topic 5) into separate
topics highlights their prominence in the dataset. The inclusion of pet
food terms (Topic 1) alongside human food items is also noteworth