Preparing The Data
Describing The Complete Data Set
Word Usage Across All Reviews
Differences Between High and Low Star Ratings
Custom Stop Words
Comparing MGM Grand to The Bellagio
- Word Usage Patterns Across the Two Buffets

Preparing The Data

The raw data followed a standard format, allowing it to be consistently parsed and structured into a data frame. Each review entry contained six key components: review ID, user ID, business name, star rating, review text, and review date. These elements were extracted and organized into a structured format, with each review occupying a separate row in the data frame.

To refine the data, several cleaning steps were performed. First, the extracted features were converted to appropriate data types: star ratings to numeric values and review dates to date format. Next, unnecessary punctuation and special characters were removed or replaced, such as converting “$” to “dollarsign” and eliminating apostrophes and hyphens to standardize text processing. Finally, to ensure the analysis focused on meaningful reviews, only those exceeding 400 characters in length were retained, as longer reviews tend to provide more insightful feedback.

These cleaning steps transformed the raw text into a structured, usable format for analysis.

# When you inspect the raw .txt file, you should note a standard pattern where each new review begins with the same identifier (series of characters).  Once you determine the character sequence for this delimiter, complete the unlist(strplit()) function to split the single string into a character vector of strings called "tempReviews" where each element represents one Yelp review (and all accompanying metadata about the review)
tempReviews <- unlist(strsplit(rawReviews, "ReviewID:", fixed= TRUE))

# Because of the formatting of the file, the first element in the tempReviews vector should be blank, remove it
# leave this code untouched
tempReviews <- tempReviews[-1]

# declare an empty data frame to store all the extracted variables from all reviews
# leave this code untouched
reviews <- data.frame(reviewID = character(), userID = character(), business = character(), stars = character(), text = character(), revDate = character())

# loop through the elements of the tempReviews vector to process each review, one by one
# Hint: the for loop structure has been given to you and does not need to change.  Within this for loop you only need to add code in the places clearly noted.  You should be able to find similar code/commands in the text pre-processing script where we organized data fields from a single news article.  This is essentially the same thing, you just need to tell R how to split the string into lines, then read data from each line into the "curDF" data frame.  The last line of code in the for loop will add the current review to the running list of all reviews.

for(i in 1:length(tempReviews)) {

  # split the current review (a single string) into a vector where each line is a new element
    revLines <- unlist(strsplit(tempReviews[i], "\r\n", fixed= TRUE))
  
  # Extract the reviewID, userID, business, stars, text (of the review), and reviewDate from the current review (each should be a separate element of the revLines vector you created in the last step, from elements 1 through 6, in order)
  curDF <- data.frame(reviewID = revLines[1], 
                      userID = revLines[2], 
                      business = revLines[3], 
                      stars = revLines[4], 
                      text = revLines[5], 
                      revDate = revLines[6])
  
  
  
  # append curDF to the master data frame (reviews) to save the current yelp review to the master list
  # no need to change this code
  reviews <- rbind(reviews, curDF)
  
}

# You should now have a complete data frame called "reviews" that has 6 columns, all of which are character data types
# convert everything in reviews$stars to be numeric

reviews$stars <- as.numeric(reviews$stars)

# convert reviews$revDate to be date-time format (hint: review the lubridate help file to determine how to extract month, day, year, hours, and minutes)

reviews$revDate <- mdy_hm(reviews$revDate)

# convert reviews$business to a factor

reviews$business <- as.factor(reviews$business)

# General data cleaning - we'll substitute and delete some characters in the reviews$text column:
# Hint: all of these can be done without the use of regular expressions, so use fixed=TRUE in your code
# replace all occurrences of a $ with the text "dollarSign" in the reviews$text column

reviews$text <- gsub("$", "dollarSign", reviews$text, fixed = TRUE)

# delete all single quotes/apostrophes (') in the reviews$text column

reviews$text <- gsub("'", "", reviews$text, fixed = TRUE)

# delete all dashes (-) in the reviews$text column

reviews$text <- gsub("-", "", reviews$text, fixed = TRUE)


# short reviews may be less meaningful than longer, more in depth reviews.  Ultimately, we only want to keep those reviews that are at least 400 characters. 
# use an appropriate subsetting/filtering command to keep only those reviews where the text column has a number of characters >= 400

reviews <- reviews |> 
  filter(nchar(text) >= 400)

Describing The Complete Data Set

The dataset consists of 3,154 reviews, each containing a user ID, the business being reviewed (either The Buffet at Bellagio or MGM Grand Buffet), a star rating, and the review text. Approximately 25% of the reviews pertain to the MGM Grand Buffet, while the remaining 75% are for The Buffet at Bellagio. On average, reviews contain 921 characters, and the overall average star rating across all reviews is 3.2.

##      n
## 1 3154

##   average_review_length average_rating
## 1               921.098        3.21655

Word Usage Across All Reviews

To prepare the review text for analysis, we tokenized each word into its own row, removed common stop words (e.g., “the,” “and,” “to”), lemmatized words to their root forms, and conducted a frequency analysis.

The most frequently occurring words included expected terms like “buffet,” “food,” and “dessert.” However, other high-frequency words—such as “price,” “line,” “service,” “quality,” “wait,” and “selection” highlight key factors that influence customer experiences and drive review content. This suggests that aspects like pricing, wait times, and service quality play a significant role in shaping customer perceptions and reviews. Addressing these areas could lead to higher satisfaction and improved ratings.

# tokenize (by word), remove stop words, lemmatize, and compute frequency statistics for the "text" column of the reviews data frame. Store these results in a data frame called "revWords"
revWords <- reviews %>%
  unnest_tokens(word, text) %>% # tokenize by word
  anti_join(stop_words) %>% # remove common stop words
  mutate(word = lemmatize_words(word))%>% # lemmatize words
  count(reviewID, word, sort=TRUE) %>% # count the number of times a word appears in each document
  bind_tf_idf(word, reviewID, n) # calculate tf, idf, and tf-idf values

head(revWords, n = 10)

##                  reviewID   word  n         tf       idf     tf_idf
## 1  kc9bE_ssRgX-_ajFOlzGZg buffet 27 0.15882353 0.1432716 0.02275490
## 2  AMiR49hC8uoDaJPVr1N7RQ buffet 22 0.11702128 0.1432716 0.01676582
## 3  FZonsUBVS920yU60K4EVaQ      5 22 0.09016393 2.3969446 0.21611795
## 4  v0OQgzDh6tH6XSVCe2uk6g buffet 20 0.07272727 0.1432716 0.01041975
## 5  6DPyBIzyPjLZ8SriM9gM_w   star 19 0.13669065 1.8640643 0.25480015
## 6  89CsjoK5R0IPdm2knJsDrg   wynn 16 0.08465608 2.1731044 0.18396651
## 7  SJedOOcNelvFGxn8d6GITg   line 16 0.09248555 1.0661703 0.09860534
## 8  Ye2zV2N5eFVk92UP7KuwYA     00 16 0.10958904 4.7242223 0.51772299
## 9  ps7MpOtkVsDsstGxJWyZjg buffet 16 0.05194805 0.1432716 0.00744268
## 10 MBfBdNi8bKMwRQ2shvx6bQ buffet 15 0.14563107 0.1432716 0.02086479

# calculate document frequencies for words in the revWords data frame and store the result into docRevWords
docRevWords <- revWords %>%
  group_by(word) %>% 
  summarise(df = n())
  
# merge number of documents (df) with revWords 
revWords <- left_join(revWords, docRevWords, by = "word")

# in performing the frequency analysis, we lost other columns (such as star rating), perform a left_join of revWords to reviews to bring this meta data back into the revWords data frame
revWords <- left_join(revWords, reviews, by = "reviewID")


# Display the 25 words that appear in the most Yelp reviews.  These results should be presented as a percentage of all reviews, not the basic df count.  
# HINT: Look for inspiration from the frequency analysis in-class script for how to display as percentages.

docRevWords |> 
  mutate(most_seen = df / nrow(docRevWords)) |> 
  arrange(desc(most_seen)) |> 
  head(25) |> 
  ggplot(aes(y = fct_reorder(word, most_seen), x = most_seen)) +
  geom_col() +
  labs(
    x = "Percentage of Reviews Word Appears",
    y = "Word"
  )

Differences Between High and Low Star Ratings

A high tf-idf score indicates that a word is particularly important in a subset of reviews, meaning it captures key aspects of specific customer experiences that influenced their ratings.

For MGM Resorts, analyzing high tf-idf words in highly rated reviews can reveal what made certain experiences exceptional, providing insights into factors that drive positive customer sentiment. Conversely, examining high tf-idf words in low-rated reviews can highlight recurring issues that negatively impact guest experiences. By emphasizing and replicating positive aspects while addressing negative ones, MGM can work to improve overall satisfaction and ratings.

Notably, words like fiancée, decoration, and invitation appear frequently in highly rated reviews, suggesting that guests celebrating special occasions at MGM properties tend to have positive experiences. On the other hand, words like vomit, puke, and quality in lower-rated reviews indicate concerns about food quality and potential health-related incidents, highlighting areas that may require attention.

# create a new data frame that filters revWords to only include reviews with a star rating >= 4
topReviews <- revWords |> filter(stars >= 4)

# Display a wordcloud of the top 75 words with the highest tf_idf values in the topReviews (4 & 5 star rating) df

wordcloud(topReviews$word, topReviews$tf_idf, max.words = 75, random.order = F, scale = c(2,.01), min.freq = 1)

# create a new data frame that filters revWords to only include reviews with a star rating <= 2
bottomReviews <- revWords |> filter(stars <= 2)

# Display a wordcloud of the top 75 words with the highest tf_idf values in the bottomReviews (1 & 2 star rating) df
wordcloud(bottomReviews$word, bottomReviews$tf_idf, max.words = 75, random.order = F, scale = c(2,.01), min.freq = 1)

Custom Stop Words

I recommend removing the words buffet, food, dessert, eat, and Bellagio as custom stop words. These five words rank among the most frequently occurring in the dataset, each appearing in 10-25% of reviews. However, they do not provide meaningful insights into customer sentiment, as they are expected to appear in nearly all reviews regardless of whether the experience was positive or negative. Removing these words will allow future analyses to focus on terms that offer more actionable insights for MGM Resorts.

Comparing MGM Grand to The Bellagio

Recalculating frequency statistics after separating the dataset by buffet allows us to identify words that are uniquely associated with each location. Some highly frequent words may be specific to a single business, reflecting aspects of service, quality, or experience that differ between the two buffets. This comparison helps MGM Resorts understand the distinct strengths and weaknesses of each buffet based on customer feedback.

# THE ENTIRE CONTENTS OF THIS CODE CHUNK SHOULD NOT APPEAR IN YOUR HTML OUTPUT REPORT
# BE SURE TO SPECIFY THE PROPER CHUNK OPTIONS SUCH THAT THIS CODE EXECUTES, BUT BOTH THE CODE AND OUTPUT ARE HIDDEN FROM THE PUBLISHED REPORT

# create a new df called bellagioRevs that filters/subsets the reviews df to keep only those where the business == "The Buffet at Bellagio"
bellagioRevs <- reviews |> filter(business == "The Buffet at Bellagio")

# create a new df called mgmRevs that filters/subsets the reviews df to keep only those where the business == "MGM Grand Buffet"
mgmRevs <- reviews |> filter(business == "MGM Grand Buffet")

# tokenize, remove stop words, remove custom stop words (as defined in the preceding chunk), lemmatize, and compute frequency statistics for the bellagioRevs df
bellagioWords <- bellagioRevs %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  anti_join(customStopwords) %>%
  mutate(word = lemmatize_words(word))%>%
  count(reviewID, word, sort=TRUE) %>% 
  bind_tf_idf(word, reviewID, n)

# calculate df for the bellagioWords data frame
docBellagioWords <-  bellagioWords |> 
  group_by(word) |> 
  summarise(df = n())

# merge number of documents (df) with bellagioWords 
bellagioWords <- left_join(bellagioWords, docBellagioWords, by = "word")
# merge all original metadata from "reviews" with bellagioWords
bellagioWords <- left_join(bellagioWords, reviews, by = "reviewID")

# tokenize, remove stop words, remove custom stop words (as defined in the preceding chunk), lemmatize, and compute frequency statistics for the mgmRevs df
mgmWords <- mgmRevs %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  anti_join(customStopwords) %>%
  mutate(word = lemmatize_words(word))%>%
  count(reviewID, word, sort=TRUE) %>% 
  bind_tf_idf(word, reviewID, n)
  
  
# calculate df for the mgmWords data frame
docMgmWords <- mgmWords |> 
  group_by(word) |> 
  summarise(df = n())
  
# merge number of documents (df) with mgmWords 
mgmWords <- left_join(mgmWords, docMgmWords, by = "word")
# merge all original metadata from "reviews" with mgmWords
mgmWords <- left_join(mgmWords, reviews, by = "reviewID")

Word Usage Patterns Across the Two Buffets

Top 10 most used words in good reviews

This initial comparison of highly rated reviews between the two buffets reveals key differences in what customers appreciate about each location. MGM Grand Buffet is positively associated with selection, variety, and price, suggesting that customers value its range of options and affordability. Bellagio, on the other hand, is frequently mentioned in relation to specific high-quality food items like crab, sushi, and prime cuts, indicating that customers remember it for premium offerings rather than sheer variety. Additionally, words like time, line, and wait in Bellagio’s reviews suggest that customers perceive the experience as efficient and well-managed.

mgmWords |> filter(stars >= 4) |> group_by(word) |>  summarise(n = sum(n)) |> arrange(desc(n)) |> 
  head(15) |> 
  ggplot(aes(y=fct_reorder(word,n),x=n)) +
  geom_col()+
  labs(title = "MGM 10 most used words for good reviews")

bellagioWords |> filter(stars >= 4) |> group_by(word) |>  summarise(n = sum(n)) |> arrange(desc(n)) |> 
  head(15) |> 
  ggplot(aes(y=fct_reorder(word,n),x=n)) +
  geom_col() +
  labs(title = "Bellagio 10 most used words for good reviews", y = "word")

mgm10GoodWords <- mgmWords |> filter(stars >= 4) |> group_by(word) |>  summarise(n = sum(n)) |> arrange(desc(n)) |> 
  head(15) |>  pull(word)

bellagio10GoodWords <- bellagioWords |> filter(stars >= 4) |> group_by(word) |>  summarise(n = sum(n)) |> arrange(desc(n)) |> 
  head(15) |> pull(word)

Words in both lists

cat(intersect(mgm10GoodWords,bellagio10GoodWords),sep = " ,")

## dinner ,crab ,selection ,time ,leg ,dessert ,price ,buffet ,lunch

Words only in MGM

cat(setdiff(mgm10GoodWords,bellagio10GoodWords),sep = " ,")

## breakfast ,drink ,brunch ,star ,variety ,do

Words only in Bellagio

cat(setdiff(bellagio10GoodWords,mgm10GoodWords),sep = " ,")

## line ,wait ,rib ,sushi ,prime ,worth

Top 10 most used words in bad reviews

This comparison highlights a key distinction between the two buffets’ negative reviews. While both MGM and Bellagio receive complaints about price and wait times, MGM Grand Buffet appears to have more dissatisfaction related to food quality (“taste”) and perceived value (“pay”). This suggests that some customers feel the food does not justify the cost at MGM.

For Bellagio, complaints about wait times dominate, with words like “line,” “time,” and “wait” being especially prominent. This indicates that while Bellagio may have stronger offerings in terms of food quality, long lines and waiting times could be a major frustration for guests. MGM Grand may need to evaluate its pricing and food quality to better meet customer expectations.

mgm10BadWords <- mgmWords |> filter(stars <= 2) |> group_by(word) |>  summarise(n = sum(n)) |> arrange(desc(n)) |> 
  head(15) |>  pull(word)

bellagio10BadWords <- bellagioWords |> filter(stars <= 2) |> group_by(word) |>  summarise(n = sum(n)) |> arrange(desc(n)) |> 
  head(15) |> pull(word)

Words in both lists

## price ,bad ,time ,crab ,didnt ,pay ,selection ,buffet ,taste ,leg

Words only in MGM

## breakfast ,drink ,grand ,dinner ,eat

Words only in Bellagio

## line ,wait ,sushi ,quality ,do

Words to look deeper into

Now we are selecting some words to look deeper into. These are the words that appeared in both MGM and Bellagio’s top 10 words in high ratings or MGM and Bellagio’s top 10 words in low rating reviews. The words are shown below:

## crab ,selection ,time ,leg ,price ,buffet

Now when comparing these six words we can see the difference in star ratings between MGM and Bellagio for ratings who’s reviews contains these specific words. Bellagio pulls ahead because its average star rating is roughly 3.3 while MGM is 2.7.

In order to correct for Bellagio’s average rating being higher than MGM’s, the average rating of each business is subtracted out from the affect of a specific word being in a review. This now depicts how a word appearing in a review affects that rating’s difference from the average at their respective busienss.

For example, reviews which contained the word buffet in a Bellagio rating were almost 0.2 stars higher than their 3.3 average. Reviews which contained the word buffet for MGM ratings were roughly 0.5 stars higher than their average of 2.7. This depicts that a review including buffet results in a higher than average star rating, however for Bellagio this affect is greater.

MGM has a greater advantage when it comes to reviews containg the words crab and leg. Suggesting that their crab leg or leg of lamb is a standout point for MGM.

Provided is the same deviation from business average rating for words which in MGM’s top 10 most frequent words for good ratings and Bellagio’s top 10 most frequent words for good ratings. Both see the greatest increase in their average rating if the words lunch and dessert are involved. This suggests that people most positively associate these buffets as a majority lunch restaurant. Price is an interesting word as it is in the list of top 10 words in positive ratings however ratings which include price are lower than average for both businesses. This suggests there are some customers where price is a strong selling point but there are more people which find price to be cause for a lower rating.

This graph depicts the same deviation from business average but for a combination of MGM and Bellagio’s top 10 words which appear in low rated reviews. The word bad was the largest indicator of a low rated review, being a relatively equal effect on rating for both. This is followed by taste and pay. For both buffets, reviews with the words pay and taste in it resulted in below average ratings. The effect is worse for MGM than Bellagio when it comes to reviews with the words taste and pay. This suggsets that MGM’s price to quality ratio is worse than Bellagio, potentially highlighting the underlying reason why their average rating is approximately 0.6 stars less than Bellagio.

HW2 - Pre-processing and Frequency Analysis

Jad Shaheen (Vlad Tarashansky)

February 12, 2025