Preparing The Data
Describing The Complete Data Set
Word Usage Across All Reviews
Differences Between High and Low Star Ratings
Custom Stop Words
Comparing MGM Grand to The Bellagio
- Word Usage Patterns Across the Two Buffets

Preparing The Data

The first step in the cleaning process of this data was identifying where each review can be separated into it’s own domain. The standard formatting of this data allowed us to split the data into the same structure for each review. Once we have done this, we are able to split that individual review into multiple variables that relate to key parts of a review format, where there is a userID, reviewID, business, stars, text, and revDate. Next, we are then able to create a data frame and implant these new variables as column headers and extract the information from the data that we split up and place them into the corresponding column. Once we have created our new data frame that treats each review as it’s own observation, we cleaned it up by taking out special characters, changing datatypes, and reformatting certain characters to provide more consistency to the data and make it easier for future analysis. This is an important step to ensure that certain characters don’t provide any more confusing noise to our data set like dashes or quotations. Additionally, it ensures our analysis can be easily carried out because if the date format was different and inconsistent, this can present problems if we decide to do any type of time analysis. Overall, it is important for our data to have the proper data type format and appearance so our analysis is consistent and gives us the proper output because a quantitative variable as a character type could present problems.

# When you inspect the raw .txt file, you should note a standard pattern where each new review begins with the same identifier (series of characters).  Once you determine the character sequence for this delimiter, complete the unlist(strplit()) function to split the single string into a character vector of strings called "tempReviews" where each element represents one Yelp review (and all accompanying metadata about the review)
tempReviews <- unlist(strsplit(rawReviews, "ReviewID", fixed= TRUE))

# Because of the formatting of the file, the first element in the tempReviews vector should be blank, remove it
# leave this code untouched
tempReviews <- tempReviews[-1]

# declare an empty data frame to store all the extracted variables from all reviews
# leave this code untouched
reviews <- data.frame(reviewID = character(), userID = character(), business = character(), stars = character(), text = character(), revDate = character())

# loop through the elements of the tempReviews vector to process each review, one by one
# Hint: the for loop structure has been given to you and does not need to change.  Within this for loop you only need to add code in the places clearly noted.  You should be able to find similar code/commands in the text pre-processing script where we organized data fields from a single news article.  This is essentially the same thing, you just need to tell R how to split the string into lines, then read data from each line into the "curDF" data frame.  The last line of code in the for loop will add the current review to the running list of all reviews.

for(i in 1:length(tempReviews)) {

  # split the current review (a single string) into a vector where each line is a new element
    revLines <- unlist(strsplit(tempReviews[i], "\r\n", fixed= TRUE))
  
 
    
  # Extract the reviewID, userID, business, stars, text (of the review), and reviewDate from the current review (each should be a separate element of the revLines vector you created in the last step, from elements 1 through 6, in order)
  curDF <- data.frame(reviewID = revLines[1], 
                      userID = revLines[2] , 
                      business = revLines[3] , 
                      stars = revLines[4], 
                      text = revLines[5], 
                      revDate = revLines[6])
  

  # append curDF to the master data frame (reviews) to save the current yelp review to the master list
  # no need to change this code
  reviews <- rbind(reviews, curDF)
  
}

# You should now have a complete data frame called "reviews" that has 6 columns, all of which are character data types
# convert everything in reviews$stars to be numeric
reviews$star <- as.numeric(reviews$stars)

# convert reviews$revDate to be date-time format (hint: review the lubridate help file to determine how to extract month, day, year, hours, and minutes)
reviews$revDate <- mdy_hm(reviews$revDate)

# convert reviews$business to a factor
reviews$business <- as.factor(reviews$business)

# General data cleaning - we'll substitute and delete some characters in the reviews$text column:
# Hint: all of these can be done without the use of regular expressions, so use fixed=TRUE in your code
# replace all occurrences of a $ with the text "dollarSign" in the reviews$text column

reviews$text <- gsub("\\$", "dollarSign", reviews$text, fixed = TRUE)

# delete all single quotes/apostrophes (') in the reviews$text column
reviews$text <- gsub("'", "", reviews$text, fixed = TRUE)


# delete all dashes (-) in the reviews$text column

reviews$text <- gsub("-", "", reviews$text, fixed = TRUE)


# short reviews may be less meaningful than longer, more in depth reviews.  Ultimately, we only want to keep those reviews that are at least 400 characters. 
# use an appropriate subsetting/filtering command to keep only those reviews where the text column has a number of characters >= 400

reviews <- reviews[nchar(reviews$text) >= 400, ]

Describing The Complete Data Set

The high level overview of this Yelp review data reveals that there are 3140 reviews across both restaurants. There is over three times the amount of reviews for The Buffet at Bellagio than MGM Grand Buffet, with MGM Grand Buffet having a little over 600 and The Buffet at Bellagio having over 2500. A majority of the reviews have 1000 characters and less with a left skewing bias towards 500 character length reviews. Lastly, the rating systems seems to slightly be right skewed with a slight bias towards 3 and 4 stars given in the reviews.

# How many reviews are in the data set after filtering to only keep those with at least 400 characters?

count(reviews)

##      n
## 1 3140

# Of this total number of reviews, how many are for "The MGM Grand Buffet" vs. "The Buffet at Bellagio" (the "business" variable) - either a data table or bar chart would be good here
reviews |> 
  ggplot(aes(x = business)) +
  geom_bar()

# Display a histogram showing the number of characters in each review (nchar(reviews$text))
# feel free to chose an appropriate number of bins to display the overall shape/skewness of the data

reviews |> 
  ggplot(aes(x = nchar(reviews$text))) +
  geom_histogram(bins = 20)

## Warning: Use of `reviews$text` is discouraged.
## ℹ Use `text` instead.

# Display a bar or column chart showing the number of reviews for each star rating

reviews |> 
  ggplot(aes(x = stars)) + 
  geom_bar()

Word Usage Across All Reviews

The data was further prepared by tokenizing the text into separate individual words. This allows us to do analysis on each word and not the entire text of the review. Next, we removed stop words, which provided no meaningful benefit to our analysis but are common in the English Language, and lastly, we lemmatized the words to give us the base form while retaining it’s meaning and grammatical structure. We found that the most frequently appearing words suggest that reviews seem to emphasize key menu items, pricing, service, and variety of offerings. There are sprinkles of key menu items that are popular; these can hit towards popular good dishes or vice versa. There are also themes of pricing/payment and service that may also highlight an important concern for customers. Lastly, the idea of selection, whether its a particular menu item or the word itself appearing, appears to be a common theme in their reviews which makes sense for a buffet style restaurant.

# tokenize (by word), remove stop words, lemmatize, and compute frequency statistics for the "text" column of the reviews data frame. Store these results in a data frame called "revWords"
revWords <- reviews |> 
  unnest_tokens(word, text) |> 
  anti_join(stop_words, by = "word") |> 
  mutate(word = lemmatize_words(word)) |> 
  count(reviewID, word, sort = TRUE)

# calculate document frequencies for words in the revWords data frame and store the result into docRevWords
docRevWords <- revWords |> 
  group_by(word) |> 
  summarise(df = n())

print(docRevWords)

## # A tibble: 10,926 × 2
##    word     df
##    <chr> <int>
##  1 0         8
##  2 0.05      1
##  3 0.88      1
##  4 00       28
##  5 00am      7
##  6 00p.m     1
##  7 00pm     20
##  8 01        1
##  9 04        2
## 10 05        4
## # ℹ 10,916 more rows

# merge number of documents (df) with revWords 
revWords <- left_join(revWords, docRevWords, by = "word")

# in performing the frequency analysis, we lost other columns (such as star rating), perform a left_join of revWords to reviews to bring this meta data back into the revWords data frame
revWords <- left_join(revWords, reviews, by = "reviewID")


# Display the 25 words that appear in the most Yelp reviews.  These results should be presented as a percentage of all reviews, not the basic df count.  
# HINT: Look for inspiration from the frequency analysis in-class script for how to display as percentages.

total_reviews <- n_distinct(reviews$reviewID)

top_words <- docRevWords |> 
  mutate(percentage = (df / total_reviews) * 100) |> 
  arrange(desc(percentage)) 
  

top_words |> 
  slice_max(df, n = 25) |>  # keep the rows with the 20 highest df values
  ggplot(aes(x = reorder(word, df), y = df)) +  # define x and y axes, sorting words in descending order based on df       values
  geom_col() + 
  geom_text(aes(label = round(percentage, 2), hjust = 1), color = "#FFFFFF") + # add text labels to bar chart
  xlab(NULL) + 
  ylab("Document Frequency") +
  coord_flip() + # flip from columnar format to bar format
  ggtitle("Words Appearing in Most Speaking Segments Across all Episodes")

Differences Between High and Low Star Ratings

A word that has high tf-idf score in this context suggest that there are particular words that are unique to 4 and 5 star reviews, and there are unique words that are used only in 1 and 2 star reviews. This means that customers who enjoyed the experience are probably going to be speaking about something different from those who didn’t enjoy it. It’s important for MGM resorts to evaluate words with high tf-idf in high and low reviews because these words can suggest a common theme in what makes their restaurants enjoyable or not. Figuring out what you’re most satisfy guest remember or speak about and what your complaints typically center around will be important to identity problems, solutions, and core successes in their business. Interestingly, the word “word” appears very frequently and is unique to 1 and 2 stars reviews which suggest MGM may have a problem with infestations and sanitation that guest have experienced. “Refund” and “Puke” are also on low score reviews which could signal customers unsatisfied with their meal and the taste if they are potentially demanding a refund and puking. On the other hand, the word “health” and “vegan” appears uniquely and commonly in highly rated reviews which could suggest that offerings that are deemed healthy are positively viewed and that the consumer base that rates highly in these restaurants are tyring their vegan options and perhaps their health conscious menu items are whats enjoyed the most. Also, the item “chorizo” is quite popular in the highly rated reviews which could suggest that menu item to be particularly a star and enjoyed by many.

# create a new data frame that filters revWords to only include reviews with a star rating >= 4
topReviews <- revWords |> 
  filter(star >= 4)

# Display a wordcloud of the top 75 words with the highest tf_idf values in the topReviews (4 & 5 star rating) df
topReviews <-
  topReviews |> 
  bind_tf_idf(word, reviewID, n)

wordcloud(topReviews$word, topReviews$tf_idf, max.words=80, random.order=F, scale = c(1.5,.9), colors = brewer.pal(8, "Set2"))

# create a new data frame that filters revWords to only include reviews with a star rating <= 2
bottomReviews <- revWords |> 
  filter(star <= 2)

glimpse(bottomReviews)

## Rows: 48,644
## Columns: 10
## $ reviewID <chr> ":SJedOOcNelvFGxn8d6GITg", ":ps7MpOtkVsDsstGxJWyZjg", ":COo_5…
## $ word     <chr> "line", "buffet", "drink", "buffet", "booth", "buffet", "buff…
## $ n        <int> 16, 16, 14, 14, 12, 12, 11, 11, 11, 11, 11, 10, 10, 10, 10, 1…
## $ df       <int> 1082, 2724, 607, 2724, 26, 2724, 2724, 2724, 333, 454, 2247, …
## $ userID   <chr> "dVG-_SawrmCIfVhEA_gBzQ", "OliPZrs4I8m8aZ3ONECzIQ", "upQW7sPf…
## $ business <fct> The Buffet at Bellagio, The Buffet at Bellagio, MGM Grand Buf…
## $ stars    <chr> "2", "2", "2", "1", "1", "2", "2", "2", "2", "1", "2", "2", "…
## $ text     <chr> "After reading a few unfavorable reviews, we still made the l…
## $ revDate  <dttm> 2014-12-27 04:51:00, 2013-08-19 19:47:00, 2014-07-11 11:29:0…
## $ star     <dbl> 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2…

# Display a wordcloud of the top 75 words with the highest tf_idf values in the bottomReviews (1 & 2 star rating) df
bottomReviews <-
  bottomReviews |> 
  bind_tf_idf(word, reviewID, n)

wordcloud(bottomReviews$word, bottomReviews$tf_idf, max.words=80, random.order=F, scale = c(1.5,.9), colors = brewer.pal(8, "Set2"))

Custom Stop Words

Words that add no real value to our analysis are common eating terms like “eating, dinner, food” and terms that surround a dinning experience like “pay, food, review.” These words are to be typical in any food review because they provide context to this situation. We already understand the environment these reviews are in and therefore, don’t need extra terms to explain they are at “dinner” “eating” “food.” Also, words that repeat the restaurant name serve no purpose; we already know what businesses we are doing our analysis on, so it’s unnecessary. Overall, stop words provided grammatical structure to our reviews but don’t reveal any meaning which is why we removed them. On the other hand, with these words, we are establishing a scenario or setting already familiar and thus serve no purpose in revealing key feelings, opinions, or pain-points that could provide more concrete direction into what makes this dining experience particular unique to MGM or Bellagio and what makes it special or bad for reviewers.

Comparing MGM Grand to The Bellagio

It would be beneficial to recalculate frequency statistics using each buffet as a separate corpus because this makes our analysis more specific and focused. This separation gives us a chance to view particularly unique clues to what makes the MGM a success for customers and what problem it specifically has vs what consumers particularly enjoyed at the Bellagio and what they found worrisome there. This allows for more tailored solutions and improvements for each restaurants because one common problem could be residing in one restaurant, not both. Also, it means that what is enjoyed at one restaurant may not be consistent to the other. We previously aggregated and analyzed the entire review set, but now, we can dive deeper into our analysis and see restaurant specific themes and unique strengths and weaknesses for each. The language could be very different across these restaurants.

Word Usage Patterns Across the Two Buffets

The analysis comparing both restaurants revealed in the first graph that first there is a much higher proportionate of positive reviews compared to negative reviews. The proportion of good to bad reviews is also higher for MGM, suggesting that it is more favorable than it’s counterpart and that its experience is probably enjoyed more due to a particular strength. Therefore, management should prioritize the Bellagio more or see what strengths it can take from MGM and recommend to Bellagio.

# Enter any code you deem necessary to examine the similarities and differences across the two buffets.  You may experiment with different ways to slice the data and/or create visualizations.  Do not include any of your exploratory code, only keep the code that directly generates the final output you wish to share with MGM Resorts.

#mgm

sentiments <- get_sentiments("bing")

sentiment_analysis <- mgmWords %>%
  inner_join(sentiments, by = "word") %>%
  group_by(business, sentiment) %>%
  summarize(count = sum(n), .groups = "drop")

ggplot(sentiment_analysis, aes(x = sentiment, y = count, fill = business)) +
  geom_col(position = "dodge") +
  labs(title = "Positive vs. Negative Words in Reviews",
       x = "Sentiment", y = "Word Count") +
  theme_minimal()

#bellagio

sentiments <- get_sentiments("bing")

sentiment_analysis <- bellagioWords %>%
  inner_join(sentiments, by = "word") %>%
  group_by(business, sentiment) %>%
  summarize(count = sum(n), .groups = "drop")

ggplot(sentiment_analysis, aes(x = sentiment, y = count, fill = business)) +
  geom_col(position = "dodge") +
  labs(title = "Positive vs. Negative Words in Reviews",
       x = "Sentiment", y = "Word Count") +
  theme_minimal()

Furthermore, the second graph reveals the most common bigrams for each restaurant. It’s important to know what bigrams are being produced because they can reveal more than a singular word by showing if a particular descriptive is used to describe a dish or if a pair of items are commonly ordered. This gives more context to the words and show greater detail into the reviewers’ feelings towards the experience. For example, a common bigrams across both restaurant is “cheap buffet”, however, “low quality,” “wait time,” and “bad attitude” are most common for MGM. This potentially shows where MGM weaknesses are and where management could improve with their ingredients, staffing, and operations. A popular unique bigram to The Bellagio is “low price” which could signal an important concern for customers that enjoy a cheaper buffet or see a greater value in this restaurant’s buffet offering. “Taste fresh” is found in both which could signal to management that consumers care about the quality of their dish and that both restaurants should continue prioritizing the freshness of the food.

# Use this extra chunk as a quick template to copy/paste multiple copies, should you want to split your visualizations into separate chunks.
# After copying and pasting, be sure to remove these comments and also update the name of the code chunk so it is unique (e.g. replace buffetCompare2 with buffetCompare3, 4, 5, etc)

#mgm

mgmBigrams <- mgmWords %>%
  unnest_tokens(bigram, text, token="ngrams", n=2) %>% 
  count(business, bigram, sort = TRUE)

mgmBigramsSplit <- mgmBigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

mgmBigramsSplit <- mgmBigramsSplit %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)


mgmBigrams <- mgmBigramsSplit %>%
  unite(bigram, word1, word2, sep = " ")

mgmBigrams$bigram <- lemmatize_strings(mgmBigrams$bigram)

mgmBigrams <- mgmBigrams %>%
  bind_tf_idf(bigram, business, n) %>%
  arrange(desc(tf_idf))

## Warning: A value for tf_idf is negative:
##  Input should have exactly one row per document-term combination.

mgmBigramsDF <- mgmBigrams %>%
  count(bigram, sort = T) %>%
  rename(df = n)

mgmBigrams <- left_join(mgmBigrams, mgmBigramsDF, by = "bigram")
wordcloud(mgmBigramsDF$bigram, mgmBigramsDF$df, max.words=50, random.order=F, scale = c(2,.6), colors = brewer.pal(8, "Set2"))

#bellagio

bellagioBigrams <- bellagioWords %>%
  unnest_tokens(bigram, text, token="ngrams", n=2) %>% 
  count(business, bigram, sort = TRUE)

bellagioBigramsSplit <- bellagioBigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bellagioBigramsSplit <- bellagioBigramsSplit %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)


bellagioBigrams <- bellagioBigramsSplit %>%
  unite(bigram, word1, word2, sep = " ")

bellagioBigrams$bigram <- lemmatize_strings(bellagioBigrams$bigram)

bellagioBigrams <- bellagioBigrams %>%
  bind_tf_idf(bigram, business, n) %>%
  arrange(desc(tf_idf))

## Warning: A value for tf_idf is negative:
##  Input should have exactly one row per document-term combination.

bellagioBigramsDF <- bellagioBigrams %>%
  count(bigram, sort = T) %>%
  rename(df = n)

bellagioBigrams <- left_join(bellagioBigrams, bellagioBigramsDF, by = "bigram")
wordcloud(bellagioBigramsDF$bigram, bellagioBigramsDF$df, max.words=50, random.order=F, scale = c(2,.6), colors = brewer.pal(8, "Set2"))

## Warning in wordcloud(bellagioBigramsDF$bigram, bellagioBigramsDF$df, max.words
## = 50, : short line could not be fit on page. It will not be plotted.

## Warning in wordcloud(bellagioBigramsDF$bigram, bellagioBigramsDF$df, max.words
## = 50, : tasty meat could not be fit on page. It will not be plotted.

## Warning in wordcloud(bellagioBigramsDF$bigram, bellagioBigramsDF$df, max.words
## = 50, : time wait could not be fit on page. It will not be plotted.

## Warning in wordcloud(bellagioBigramsDF$bigram, bellagioBigramsDF$df, max.words
## = 50, : wait time could not be fit on page. It will not be plotted.

## Warning in wordcloud(bellagioBigramsDF$bigram, bellagioBigramsDF$df, max.words
## = 50, : waste food could not be fit on page. It will not be plotted.

Lastly, the final graph reveals the most common words founds in 1 star and 5 star reviews for each restaurant. First, a very popular word found in 5 star reviews across both restaurants is “dessert” which could suggest consumers greatly enjoy this offering. Dessert could be the restaurants’ specialty and standout item and restaurant managers should continue maintaining this consistency. An interesting point is that “price” is in the top 10 most popular for The Bellagio in 5 star reviews and our previous graph revealed that “low price” was a common bigram. This supports our analysis that the Bellagio is likely a better value experience with consumers attributing its 5 star success to the lower price, deeming it a fair experience. On the other hand, we see that the MGM has “price” as a top ten word used in the 1 star reviews. This signals the opposite for MGM by revealing that customers likely think MGM is too expensive and not a fair value. “Money” is also in the top 10 to also follow the theme that paying for this dinning experience is likely not worth it to customers. Therefore, management should consider opportunities to bring more value to customers whether through the food or dining experience to have them feel like their money was well spent. This type of strategy is likely going to lead to growth or additional spending. The menu item “crab” is popular among both restaurants and but finds itself on both 1 star and 5 star reviews for the Bellagio, however, only appears in 5 star reviews for the MGM. This could suggest that MGM either makes a better dish, has a fresher seafood offering, or is a better value offering. This dish is popular among consumers, so management should look to MGM on why it’s such stand out in 5 star reviews for MGM and what they particularly do to make it successful and apply it to the Bellagio to prevent poorly rated reviews for the crab. Lastly, we can simply draw important themes that customers care about. For example, words like “wait” and “time” are consistent across 1 and 5 star reviews and across both restaurants. This suggest that customer service is a critical component to determining how customers assess their dining experience and should be a priority for management, beyond simply being concerned on food and pricing.

word_star_rating <- mgmWords %>%
  group_by(stars, word) %>%
  summarize(count = sum(n), .groups = "drop") %>%
  filter(stars %in% c(1, 5))

# Get top words for 1-star and 5-star reviews
top_words_by_star <- word_star_rating %>%
  group_by(stars) %>%
  slice_max(count, n = 10)

# Plot bar chart
ggplot(top_words_by_star, aes(x = reorder(word, count), y = count, fill = as.factor(stars))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~stars, scales = "free") +
  labs(title = "Top Words in 1-Star vs. 5-Star Reviews MGM",
       x = "Word", y = "Count")

word_star_rating <- bellagioWords %>%
  group_by(stars, word) %>%
  summarize(count = sum(n), .groups = "drop") %>%
  filter(stars %in% c(1, 5))

# Get top words for 1-star and 5-star reviews
top_words_by_star <- word_star_rating %>%
  group_by(stars) %>%
  slice_max(count, n = 10)

# Plot bar chart
ggplot(top_words_by_star, aes(x = reorder(word, count), y = count, fill = as.factor(stars))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~stars, scales = "free") +
  labs(title = "Top Words in 1-Star vs. 5-Star Reviews Bellagio",
       x = "Word", y = "Count")

HW2 - Pre-processing and Frequency Analysis

William Truong (Addie Burrow)

February 12, 2025