Breakfast Reviews

Author

Mick Rathbone

Introduction

This data comes from yelp reviews for a Cincinnati Denny’s and Bob Evan’s, with the goal of comparing the two locations and the reviewers’ sentiments. I will examine three questions that achieve said goal and also see if this matches the overall ratings given by the reviewers.

# Loading packages
library(tidytext)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.2.1     ✔ dplyr   1.1.3
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

breakfast_reviews <- 
  read_csv("BreakfastReviews.csv")

Rows: 80 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): reviewer_location, review_content, restaurant
dbl  (1): review_rating
date (1): review_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What are the most common words in reviews for both restaurants?

bing <- 
  get_sentiments("bing")

tidy_word <-
  breakfast_reviews %>% 
  unnest_tokens(word,restaurant)

tidy_breakfast <- 
  breakfast_reviews %>%
  unnest_tokens(word,review_content) %>%
  anti_join(stop_words) %>% 
  anti_join(tidy_word)

Joining with `by = join_by(word)`
Joining with `by = join_by(reviewer_location, review_date, review_rating,
word)`

breakfast_counts <- 
  tidy_breakfast %>% 
  group_by(restaurant, word) %>% 
  summarize(n = n()) %>% 
  inner_join(bing)

`summarise()` has grouped output by 'restaurant'. You can override using the
`.groups` argument.
Joining with `by = join_by(word)`

breakfast_counts %>% 
  group_by(restaurant) %>%
  filter(n>5) %>% 
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~restaurant, ncol = 2) +
  geom_text(aes(label = signif(n, digits = 3)), nudge_y = 8) +
  labs(title = "Positive and Negative Words for Restaurants",
       subtitle = "Only words appearing at least 5 times are shown",
       x = "Words",
       y = "Number of Times Word Appears")

When first looking at the reviews, finding common words that are both positive and negative helps provide some initial insights into the overall sentiment. Both restaurants had three positive words that appeared at least 5 times, but this can be slightly misleading without viewing them in context. “Pretty” is a difficult word to characterize, as something could be aesthetically pretty or “pretty good/bad”. “Grand” is also misleading as it is the start for a menu item at Denny’s. This leaves “ready”, “delicious”, and “sweet” as the three words that are most valuable for positive sentiment. Therefore, the following visual is more representative of the reviews.

breakfast_counts %>% 
  group_by(restaurant) %>%
  filter(!word == "grand",!word == "pretty") %>% 
  filter(n>5) %>% 
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~restaurant, ncol = 2) +
  geom_text(aes(label = signif(n, digits = 3)), nudge_y = 8) +
  labs(title = "Positive and Negative Words for Restaurants",
       subtitle = "Only words appearing at least 5 times are shown",
       x = "Words",
       y = "Number of Times Word Appears")

Which restaurant had a better overall positivity score and does this match the ratings for each restaurant?

tidy_breakfast %>%
  inner_join(bing) %>% 
  #filter(!word == "grand",!word == "pretty") %>%
  group_by(restaurant, sentiment) %>% 
  summarize(n = n()) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>% 
  ggplot(aes(x = restaurant, y = sentiment)) +
  geom_col(position = "dodge") +
  labs(title = "Restaurant Positivity Scores",
       subtitle = "Positivity score is the total number of positive words minus total negative words",
       y = "Total Positivity Score",
       x = "Restaurant")

Joining with `by = join_by(word)`
`summarise()` has grouped output by 'restaurant'. You can override using the
`.groups` argument.

tidy_breakfast %>%
  inner_join(bing) %>% 
  filter(!word == "grand",!word == "pretty") %>%
  group_by(restaurant, sentiment) %>% 
  summarize(n = n()) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>% 
  ggplot(aes(x = restaurant, y = sentiment)) +
  geom_col(position = "dodge") +
  labs(title = "Restaurant Positivity Scores",
       subtitle = "Positivity score is the total number of positive words minus total negative words",
       y = "Total Positivity Score",
       x = "Restaurant")

Joining with `by = join_by(word)`
`summarise()` has grouped output by 'restaurant'. You can override using the
`.groups` argument.

breakfast_reviews %>% 
  group_by(restaurant) %>% 
  summarize(AverageRating = mean(review_rating)) %>% 
  arrange(desc(AverageRating))

# A tibble: 2 × 2
  restaurant AverageRating
  <chr>              <dbl>
1 Bob Evans            3.1
2 Denny's              2.7

Determining the overall positivity score for both restaurants is important as it allows us to know which restaurant is better liked by the reviewers. I made two variations of the visualization to answer this question, one with a filter on the words “pretty” and “grand” and one without. Showing these two together shows the importance of the filter, as without filtering these words out, Denny’s has a high positivity score. However, Denny’s actually has a score of 0, meaning the reviews were neither positive or negative overall, but instead a mix for both. Bob Evans, on the other hand, has a negative score with and without the filter and is significantly lower than Denny’s.

While Bob Evans was much lower with sentiment score, the average rating given by reviewers was actually 15% higher than Denny’s. This is a strange result and also shows that sentiment analysis and the use of lexicons such as bing does not always tell the full story.

What were the most common words across both reviews and does this match the overall restaurant ratings?

breakfast_counts1 <- 
  tidy_breakfast %>% 
  group_by(word) %>% 
  summarize(n = n()) %>% 
  inner_join(bing)

Joining with `by = join_by(word)`

breakfast_counts1 %>%
  filter(!word == "grand",!word == "pretty") %>%
  filter(n > 5) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette="Set1") +
  labs(title = "Restaurant Sentiment Scores by Word",
       subtitle = "Scorable words appearing at least 5 times",
       x = "Words",
       y = "Number of Times Word Appears")

breakfast_reviews %>% 
  summarize(AverageRating = mean(review_rating)) %>% 
  arrange(desc(AverageRating))

# A tibble: 1 × 1
  AverageRating
          <dbl>
1           2.9

To answer this question, I didn’t group the words by restaurant but instead looked at the reviews all together. There were four positive words that had at least five appearances in the reviews, while there were seven negative words. This would imply that the rating of the restaurants should not be very high, and the 2.9 that was given by the table confirms this.