1. Executive Summary

Last spring semester, I studied abroad in Europe and had the chance to stay in a wide variety of accommodations. Since I traveled to new places almost every week, I accumulated extensive experience booking hotels. I always read reviews carefully before making a reservation—sometimes the hotel turned out to be much better than expected, sometimes worse, and sometimes the reviews felt spot on.

As such, when faced with countless hotel options, customer reviews are among the most trusted sources of information. However, numerical ratings alone often fail to capture the real experiences or emotions of guests, and going through thousands of reviews manually can be tiring—even before the trip begins.

Based on this personal experience, this project analyzes the 515K Hotel Reviews Data in Europe, a dataset containing 515,000 customer reviews and ratings for 1,493 luxury hotels across Europe. The project aims to explore the relationship between review text and numerical ratings, examine brand-specific patterns in review content, and build a simple keyword-based hotel recommendation model for a hypothetical traveler.

The analysis centers on three core themes:

1.Identify the various emotional expressions and review elements that characterize positive versus negative feedback.

2.Analyze and compare each hotel’s key attributes.

3.Leverage these insights to recommend the most suitable hotel for a hypothetical guest.

Through this project, it will be possible to uncover how emotional expressions in reviews correlate with numerical ratings, identify brand-level differences in customer feedback, and determine which aspects of review content are most useful when recommending hotels to travelers with specific preferences.

Analysis 1: Emotion vs. Reviewer Score

To begin with, I loaded the data.

# Load file
hotel_raw <- read.csv("Hotel_Reviews.csv")
hotel_raw

To explore the relationship between words and sentiment, I will analyze the Top 10 Words in Positive vs. Negative Reviews.

# 1) Load packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)

# 2) Create the 'reviews' dataset
reviews <- hotel_raw %>%
  select(Hotel_Name,
         Reviewer_Score,
         Positive_Review,
         Negative_Review)

# 3) Tokenize and remove stop words from Positive Reviews
pos_words <- reviews %>%
  filter(!(Positive_Review == "No Positive")) %>%
  unnest_tokens(word, Positive_Review) %>%
  anti_join(stop_words, by = "word")

# 4) Tokenize and remove stop words from Negative Reviews
neg_words <- reviews %>%
  filter(Negative_Review != "No Negative") %>%
  unnest_tokens(word, Negative_Review) %>%
  anti_join(stop_words, by = "word")

# 5) Extract top 10 words for each sentiment
top_pos <- pos_words %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 10) %>%
  mutate(sentiment = "Positive")

top_neg <- neg_words %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 10) %>%
  mutate(sentiment = "Negative")

word_counts <- bind_rows(top_pos, top_neg)


# 6) Visualize the results
wc <- word_counts %>%
  mutate(word = reorder_within(word, n, sentiment))

ggplot(wc, aes(x = word, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  theme_minimal() +
  labs(
    title = "Top 10 Words in Positive vs Negative Reviews",
    y     = "Frequency"
  )

The results were somewhat unclear. Since only a single word appeared in each case, it’s difficult to understand why that word was associated with a positive or negative sentiment.

For example, words like friendly, helpful, nice, excellent, and comfortable clearly reflect positive sentiment, making them easy to interpret. On the other hand, words like night, shower, and service in the negative category are more ambiguous—these words alone don’t strongly indicate a negative review.

In other words, while we can infer that negative experiences related to night, shower, or service might lead to poor reviews, these words by themselves are not enough to determine whether a hotel received a low rating. The same logic applies to positive reviews.

To make the results more intuitive, I will first visualize them using a word cloud.

# Load packages
library(wordcloud)

## Loading required package: RColorBrewer

# Positive Reviews Wordcloud
pos_words %>% 
  count(word, sort = TRUE) %>% 
  with(
    wordcloud(
      word, n,
      max.words    = 100,
      scale        = c(4, 0.5),
      random.order = FALSE,
      colors       = "skyblue"
    )
  ) 
title("Positive Reviews")

# Negative Reviews Wordcloud
neg_words %>% 
  count(word, sort = TRUE) %>% 
  with(
    wordcloud(
      word, n,
      max.words    = 100,
      scale        = c(4, 0.5),
      random.order = FALSE,
      colors       = "pink"
    )
  )
title("Negative Reviews")

While the word cloud provides some visual insight, it still reflects the limitations of single-word analysis. To address this, I will shift the focus to analyzing multi-word expressions.

Looking at common phrases or word combinations in positive and negative reviews can provide a more meaningful and clearer understanding than analyzing individual words alone.

# Sample 50,000 positive and 50,000 negative reviews
## Since there are too many original reviews

pos_sample <- reviews %>%
  filter(Positive_Review != "No Positive") %>%
  slice_sample(n = 50000)

neg_sample <- reviews %>%
  filter(Negative_Review != "No Negative") %>%
  slice_sample(n = 50000)

# 1) Extract bigrams from positive reviews and remove stop words
pos_bigrams <- pos_sample %>%
  unnest_tokens(bigram, Positive_Review, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>% 
  separate(bigram, into = c("w1", "w2"), sep = " ") %>%
  filter(!w1 %in% stop_words$word,
         !w2 %in% stop_words$word) %>%
  unite(bigram, w1, w2, sep = " ") %>%
  count(bigram, sort = TRUE) %>%
  slice_max(n, n = 10) %>%
  mutate(sentiment = "Positive")

# 2) Extract bigrams from negative reviews and remove stop words
neg_bigrams <- neg_sample %>%
  unnest_tokens(bigram, Negative_Review, token = "ngrams", n = 2) %>% 
  filter(!is.na(bigram)) %>% 
  separate(bigram, into = c("w1", "w2"), sep = " ") %>%
  filter(!w1 %in% stop_words$word,
         !w2 %in% stop_words$word) %>%
  unite(bigram, w1, w2, sep = " ") %>%
  count(bigram, sort = TRUE) %>%
  slice_max(n, n = 10) %>%
  mutate(sentiment = "Negative")

# 3) Combine results for visualization
bigrams <- bind_rows(pos_bigrams, neg_bigrams)

# 4) Visualize
library(forcats)
ggplot(bigrams, aes(x = fct_reorder(bigram, n), y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free") +
  coord_flip() +
  labs(
    title = "Top 10 Bigrams in Sampled Positive vs Negative Reviews",
    y = "Frequency"
  ) +
  theme_minimal()

The analysis is now much clearer than before. In positive reviews, phrases like helpful staff, excellent location, comfortable bed, and perfect location clearly reflect positive sentiment, even without additional context.

However, for negative reviews, limitations still remain. Since the results are based on word frequency, context is often missing. For example, phrases like mini bar or hot water ideally would appear in contexts such as the mini bar was expensive or there was no hot water, but only the object itself is extracted, making interpretation difficult.

To overcome this, one option could be to analyze sentiment words directly. But that’s not the goal here. The purpose of this analysis isn’t to classify reviews simply because words like sad or nice appear.

Rather, the goal is to identify which words or phrases are more frequently used in positive or negative reviews, and from that, uncover the relationship between word usage and review sentiment. To achieve this, I will proceed with an analysis using log odds.

# 1) Combine review texts, tokenize words, remove stop words & filter words
tokens <- hotel_raw %>%
  mutate(text = str_c(Positive_Review, Negative_Review, sep = " ")) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by="word") %>%
  filter(str_detect(word, "^[a-z]+$"),
         str_length(word) > 2)

# 2) Tag reviews as High vs. Low rating groups
tokens <- tokens %>%
  mutate(group = if_else(Reviewer_Score >= 7, "High", "Low"))

# 3) Count word frequency by group
counts <- tokens %>%
  count(group, word) %>%
  pivot_wider(names_from = group, values_from = n, values_fill = 0)

# 4) Calculate total word counts
N_high <- sum(counts$High)
N_low  <- sum(counts$Low)

# 5) Compute log odds ratio (with add-1 smoothing)
counts <- counts %>%
  mutate(
    odds_high = (High + 1) / (N_high - High + 1),
    odds_low  = (Low  + 1) / (N_low  - Low  + 1),
    log_odds  = log(odds_high / odds_low)
  )

# 6) Extract Top 10 words specific to High and Low rating groups
top_high <- counts %>%
  arrange(desc(log_odds)) %>%
  slice_head(n = 10) %>%
  mutate(direction = "High")

top_low <- counts %>%
  arrange(log_odds) %>%
  slice_head(n = 10) %>%
  mutate(direction = "Low")

top_words <- bind_rows(top_high, top_low)

# 7) Visualize
ggplot(top_words, aes(
    x = reorder(word, log_odds),
    y = log_odds,
    fill = direction
  )) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c(High = "steelblue", Low = "tomato")) +
  coord_flip() +
  labs(
    title    = "Words Most Characteristic of High vs Low Ratings",
    x        = NULL,
    y        = "Log-Odds Ratio"
  ) +
  theme_minimal()

Unlike simple word frequency, the log odds ratio captures relative differences between the two groups, helping to identify the words that truly set them apart. In fact, several interesting words that hadn’t appeared in earlier analyses were revealed through this method.

The results are as follows: [Positive] flawless niggle heartbeat oasis criticise exceeded fountains unobtrusive thoughtful

[Negative] nock fitty enemy beetle liars vermin rud lawyer worest hutch

However, some ambiguous words still remain.

For instance, it’s unclear why niggle—a word typically associated with negative sentiment—appears in positive reviews, or why hutch, which seems unrelated, is found in negative ones. These cases warrant closer inspection.

To investigate further, I will examine three examples each from positive and negative reviews that include these words to better understand their context.

For this, I referred to the regex pattern regex(“\bniggle\b”) suggested by ChatGPT.

# Extract sentences containing "niggle" from positive reviews
hotel_raw %>%
  filter(Positive_Review != "No Positive") %>%
  unnest_tokens(sentence, Positive_Review, token = "sentences") %>%
  filter(str_detect(sentence, regex("\\bniggle\\b", ignore_case = TRUE))) %>%
  slice_head(n = 3) %>%
  pull(sentence) -> niggle_sentences

# Extract sentences containing "hutch" from negative reviews  
hotel_raw %>%
  filter(Negative_Review != "No Negative") %>%
  unnest_tokens(sentence, Negative_Review, token = "sentences") %>%
  filter(str_detect(sentence, regex("\\bhutch\\b", ignore_case = TRUE))) %>%
  slice_head(n = 3) %>%
  pull(sentence) -> hutch_sentences

# Check the results  
niggle_sentences

## [1] "everything was amazing staff very friendly and extremely helpful location and appearance of hotel was exceptional expensive option but you get what you pay for one small niggle which i will be speaking to the hotel about but other than that would be my first choice when visiting the o2 in future"                                                                                                     
## [2] "check in was brilliant the room was beautiful although could have done with a fan as no air conditioning breakfast was extensive only downside was that the milk that we asked for for coffee in the room never arrived small niggle"                                                                                                                                                                         
## [3] "nice touch with the cookies on arrival and whenever you fancied one we were staying for 3 nights and opted not to take the housekeeping every day as offered so ran out of tea bags and milk which was the only niggle we had to ask for them at reception staff were all very helpful and cheerful throughout our stay very comfortable bed breakfast choice was excellent and catered for all dietary needs"

hutch_sentences

## [1] "our first night put in a rabbit hutch of a room unable to close door properly because bed was in the way the bathroom door also touched the bed the bathroom was just a small wet room no extractor fan took shower room full of water mirrors frosted over impossible to clear hotel obviously taking advantage of customers using this room outrageous to claim it s a double room 219"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [2] "wifi doesn t work tv remote doesn t work room is the size of a rabbit hutch"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [3] "the room was appalling forget the pictures on the website the twin room was so small you couldn t swing a cat and housed in the wing accessed via two lifts and a route march all the hallmarks of a larger room that had been split into a smaller one to sweat the asset curtains fitted into the window recess and not the outside result light flooding in at any time this is a capital city hotel for goodness sake the lights never dim outside no room between the beds and very little down the sides a standard lamp in one corner that you walked into every time you walked around the beds an office chair that couldn t come out from under the desk because the foot of the bed was too close you entered the room and walked straight into the side of the wardrobe full entry required negotiating a dog leg movement a mini fridge housed in the wardrobe that couldn t be accessed unless you opened both wardrobe doors a life expired tv an en suite so small that the sink was an arm rest when sat on the lavatory the hotel was booked because my daughter was undertaking a theatre experience in the west end and in particular the apollo victoria theatre whilst the theatre experience was wonderful this room really took the shine off what an end to a hard working day sleeping in a rabbit hutch"

In the positive reviews, niggle appeared in contexts like:

[1] small niggle

[2] small niggle

[3] the only niggle

This indicates that reviewers used the word to express minor complaints, often as a small caveat after an overall positive experience.

In contrast, in the negative reviews, hutch appeared in sentences such as:

[1] a rabbit hutch of a room unable to close door

[2] room is the size of a rabbit hutch

[3] a hard working day sleeping in a rabbit hutch

Here, the word was used to criticize the room size—comparing it to a cramped rabbit hutch.

Considering the context, the usage of both words becomes much clearer.

With this understanding, we now move on to analyze the log odds of bigrams in positive vs. negative reviews.

# 1) Combine review texts 
df <- hotel_raw %>%
  mutate(text = str_c(Positive_Review, Negative_Review, sep = " "))

# 2) Tokenize into bigrams and remove stop words
bigrams <- df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, into = c("w1","w2"), sep = " ") %>%
  filter(!w1 %in% stop_words$word,
         !w2 %in% stop_words$word,
         str_detect(w1, "^[a-z]+$"),
         str_detect(w2, "^[a-z]+$")) %>%
  unite(bigram, w1, w2, sep = " ")

# 3) Tag as High/Low rating groups 
bigrams <- bigrams %>%
  mutate(group = if_else(Reviewer_Score >= 7, "High", "Low"))

# 4) Count bigram frequency by group 
counts_ng <- bigrams %>%
  count(group, bigram) %>%
  pivot_wider(names_from = group, values_from = n, values_fill = 0)

# 5) Calculate total token count  
N_high <- sum(counts_ng$High)
N_low  <- sum(counts_ng$Low)

# 6) Compute log odds  
counts_ng <- counts_ng %>%
  mutate(
    odds_high = (High + 1)/(N_high - High + 1),
    odds_low  = (Low  + 1)/(N_low  - Low  + 1),
    log_odds  = log(odds_high/odds_low)
  )

# 7) Extract Top 10 bigrams for High / Low groups 
top_high_bg <- counts_ng %>% arrange(desc(log_odds)) %>% slice_head(n=10) %>% mutate(direction="High")
top_low_bg  <- counts_ng %>% arrange(log_odds)           %>% slice_head(n=10) %>% mutate(direction="Low")
top_bigrams <- bind_rows(top_high_bg, top_low_bg)

# 8) Visualize the results
library(ggplot2)
library(forcats)

ggplot(top_bigrams, aes(x = reorder(bigram, log_odds), y = log_odds, fill = direction)) +
  geom_col(show.legend=FALSE) +
  coord_flip() +
  scale_fill_manual(values = c(High="steelblue", Low="tomato")) +
  labs(
    title = "Top Bigrams Characteristic of High vs Low Ratings",
    x = NULL, y = "Log-Odds (High/Low)"
  ) +
  theme_minimal()

This analysis yielded the most definitive results so far.

The bigrams associated with positive reviews were clearly positive, while those linked to negative reviews strongly reflected dissatisfaction.

Each word was clearly grounded in context, making it easy to understand how it was used and what led the reviewer to leave that feedback.

Analysis 2: Brand

This CSV file contains a wide range of hotels. To make the analysis more applicable to real-life situations, I will create fictional customer personas and match them with the most suitable hotels for personalized recommendations.

As a first step, I will extract only global hotel brands and examine whether there are meaningful differences between them. Using TF–IDF analysis, I aim to uncover the unique features and characteristics associated with each brand.

# 0) Load packages  
library(tidyverse)
library(knitr)

# 1) Define global brand list  
global_brands <- c(
  "andaz", "autograph", "conrad", "courtyard", "crowne plaza", "doubletree",
  "embassy suites", "fairmont", "four points", "four seasons", "grand hyatt",
  "hampton", "hilton", "holiday inn", "holiday inn express", "hotel indigo",
  "hyatt", "hyatt place", "ibis", "intercontinental", "jw marriott", "le meridien",
  "mandarin oriental", "marriott", "mercure", "novotel", "park hyatt", "park inn",
  "pullman", "radisson blu", "sheraton"
)

# 2) Create brand-to-segment mapping table 
brand_segment_map <- tribble(
  ~Brand,               ~Segment,
  "Ibis",               "Economy",
  "Hampton",            "Economy",
  "Holiday Inn Express","Economy",
  "Park Inn",           "Economy",
  
  "Holiday Inn",        "Midscale",
  "Courtyard",          "Midscale",
  "Four Points",        "Midscale",
  "Novotel",            "Midscale",
  "Mercure",            "Midscale",
  "Hyatt Place",        "Midscale",
  
  "Doubletree",         "Upscale",
  "Hilton",             "Upscale",
  "Sheraton",           "Upscale",
  "Radisson Blu",       "Upscale",
  "Crowne Plaza",       "Upscale",
  "Pullman",            "Upscale",
  "Hotel Indigo",       "Upscale",
  "Autograph Collection","Upscale",
  "Autograph",            "Upscale",
  "Hyatt",                "Upscale",
  
  "Marriott",           "Luxury",
  "Intercontinental",   "Luxury",
  "Jw Marriott",        "Luxury",
  "Grand Hyatt",        "Luxury",
  "Park Hyatt",         "Luxury",
  "Le Meridien",        "Luxury",
  "Four Seasons",       "Luxury",
  "Fairmont",           "Luxury",
  "Mandarin Oriental",  "Luxury",
  "Andaz",              "Luxury",
  "Conrad",               "Luxury"
)

# 3) Filter only global brands  
hotel_raw <- read.csv("Hotel_Reviews.csv")

global_hotel <- hotel_raw %>%
  mutate(hotel_lower = str_to_lower(Hotel_Name)) %>%
  filter(str_detect(hotel_lower, str_c(global_brands, collapse = "|")))

# 4) Extract actual brand names from filtered rows  
pattern <- str_c("\\b(", paste(global_brands, collapse = "|"), ")\\b")
global_hotel <- global_hotel %>%
  mutate(
    brand_lower = str_extract(hotel_lower, regex(pattern, ignore_case = TRUE)),
    Brand       = if_else(!is.na(brand_lower),
                          str_to_title(brand_lower),
                          "Other")
  )

# 5) Map Brand to Segment  
global_hotel <- global_hotel %>%
  left_join(brand_segment_map, by = "Brand") %>%
  replace_na(list(Segment = "Other"))

# 6) Clean up final columns and display the table  
global_hotel %>%
  select(Hotel_Name, Brand, Segment, Reviewer_Score,Positive_Review, Negative_Review)

After grouping the hotels by brand and categorizing the brands based on price, I will count the number of hotels in each group. This will help manage the scope of the analysis and make the classification results clearer.

First, I’ll count how many hotels belong to each brand—since the same brand, like Hyatt, can have multiple locations across different cities.

Next, I’ll check how many hotels fall into each price category.

# 1) Count number of hotels per brand  
brand_hotel_types <- global_hotel %>%
  group_by(Brand) %>%
  summarise(hotel_types = n_distinct(Hotel_Name)) %>%
  arrange(desc(hotel_types))
brand_hotel_types

# 각 Brand-Hotel 이름
brand_hotel_list <- global_hotel %>%
  distinct(Brand, Hotel_Name) %>%
  arrange(Brand, Hotel_Name)
brand_hotel_list

# 3) Count number of brands per segment  
segment_brand_counts <- global_hotel %>%
  select(Brand, Segment) %>%
  distinct() %>%                   # 중복된 Brand–Segment 조합 제거
  count(Segment, name = "brand_count") %>%
  arrange(desc(brand_count))
segment_brand_counts

Analysis 3: Recommendations

This persona was created by randomly combining various accommodation-related preferences. Here’s the profile:

🧑‍💼 Name: Seohyun Lee (34 years old, UX Designer, based in Seoul)

#🔍 Travel Purpose

A solo trip across Europe for relaxation and personal renewal. After being worn out from a fast-paced work life, Seohyun is taking time for herself by visiting several European cities.

#✅ Core Values

She values walkability, convenient access to public transportation, quiet and peaceful rooms, canal views, and an overall emotional ambiance. While she hopes to feel completely relaxed and at ease in her accommodations, she also enjoys exploring the city and experiencing local culture outside.

#🏙️ Accommodation Preferences

Close proximity to a tram or metro station A room with a canal view A safe and quiet atmosphere, ideal for solo travelers A spacious, clean bathroom with strong water pressure Small thoughtful touches like complimentary water or tea/coffee Friendly front desk staff and a smooth check-in process

#🙅‍♀️ Sensitive to

Noise from ongoing renovations or outdated facilities Areas crowded with heavy tourist traffic

#🍽️ Looking for Nearby

Local restaurants, especially those with vegetarian options Quiet cafés Walkable neighborhoods with a relaxed atmosphere She prioritizes feeling like she’s living the city’s everyday life, rather than simply checking off tourist attractions.

✳ Hyatt World Member

She previously had a great experience at a Hyatt Regency and would love to stay at another one. She’s now looking to find which Hyatt Regency in Europe would best suit her preferences.

# 1) Filter hotels containing “hyatt regency” and combine positive and negative reviews into one text column
regency_df <- hotel_raw %>%
  filter(str_detect(tolower(Hotel_Name), "hyatt regency")) %>%
  mutate(full_text = str_c(Positive_Review, Negative_Review, sep = " "))

# 2) Tokenize text into individual words and remove stop words
#To avoid regional names (e.g., cities) appearing repeatedly and influencing TF-IDF results, I will remove them as custom stop words
custom_stops <- tibble(word = c("paris", "amsterdam", "london", "churchill"))

regency_tokens <- regency_df %>%
  unnest_tokens(word, full_text) %>%
  anti_join(stop_words, by = "word") %>%
  anti_join(custom_stops, by="word") %>% 
  filter(str_detect(word, "^[a-z]+$")) %>% 
  filter(str_length(word) > 2) 

# 3) Count word frequency for each hotel
regency_word_counts <- regency_tokens %>%
  count(Hotel_Name, word, sort = TRUE)

# 4) Calculate TF–IDF values

regency_tfidf <- regency_word_counts %>%
  bind_tf_idf(term = word, document = Hotel_Name, n = n)

# 5) Extract the top 10 TF–IDF terms for each hotel
top_regency <- regency_tfidf %>%
  group_by(Hotel_Name) %>%
  slice_max(order_by = tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup()

# 6) Display the results in a table
top_regency %>%
  select(Hotel_Name, word, tf_idf) %>%
  arrange(Hotel_Name, desc(tf_idf)) %>%
  print(n = 30)

## # A tibble: 30 × 3
##    Hotel_Name                         word           tf_idf
##    <chr>                              <chr>           <dbl>
##  1 Hyatt Regency Amsterdam            tram          0.00515
##  2 Hyatt Regency Amsterdam            neighbourhood 0.00309
##  3 Hyatt Regency Amsterdam            centre        0.00304
##  4 Hyatt Regency Amsterdam            brand         0.00228
##  5 Hyatt Regency Amsterdam            metro         0.00228
##  6 Hyatt Regency Amsterdam            bfast         0.00206
##  7 Hyatt Regency Amsterdam            bother        0.00206
##  8 Hyatt Regency Amsterdam            canal         0.00206
##  9 Hyatt Regency Amsterdam            fire          0.00206
## 10 Hyatt Regency Amsterdam            gem           0.00206
## 11 Hyatt Regency London The Churchill size          0.00336
## 12 Hyatt Regency London The Churchill oxford        0.00303
## 13 Hyatt Regency London The Churchill water         0.00240
## 14 Hyatt Regency London The Churchill late          0.00144
## 15 Hyatt Regency London The Churchill appointed     0.00130
## 16 Hyatt Regency London The Churchill blah          0.00130
## 17 Hyatt Regency London The Churchill portman       0.00130
## 18 Hyatt Regency London The Churchill square        0.00130
## 19 Hyatt Regency London The Churchill washing       0.00130
## 20 Hyatt Regency London The Churchill bathrooms     0.00128
## 21 Hyatt Regency Paris Etoile         tower         0.0102 
## 22 Hyatt Regency Paris Etoile         eiffel        0.00863
## 23 Hyatt Regency Paris Etoile         renovation    0.00358
## 24 Hyatt Regency Paris Etoile         metro         0.00241
## 25 Hyatt Regency Paris Etoile         dated         0.00221
## 26 Hyatt Regency Paris Etoile         water         0.00220
## 27 Hyatt Regency Paris Etoile         broken        0.00182
## 28 Hyatt Regency Paris Etoile         refurbishment 0.00182
## 29 Hyatt Regency Paris Etoile         arc           0.00173
## 30 Hyatt Regency Paris Etoile         outdated      0.00163

# 7) Visualize the results
library(ggplot2)
library(tidytext)  # for reorder_within, scale_x_reordered
ggplot(top_regency, aes(
    x    = reorder_within(word, tf_idf, Hotel_Name),
    y    = tf_idf,
    fill = Hotel_Name
  )) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Hotel_Name, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  labs(
    title = "Top 10 TF–IDF Terms for Hyatt Regency Hotels",
  ) +
  theme_minimal()

library(dplyr); library(tidytext); library(widyr)
library(tidygraph); library(ggraph)

## 
## Attaching package: 'tidygraph'

## The following object is masked from 'package:stats':
## 
##     filter

# 1) Extract reviews for Hyatt Regency Amsterdam and generate tokens
tokens_amsterdam <- regency_df %>%
  filter(Hotel_Name == "Hyatt Regency Amsterdam") %>%
  tibble::rowid_to_column("review_id") %>%
  unnest_tokens(word, full_text) %>%
  anti_join(stop_words,   by = "word") %>%
  anti_join(custom_stops, by = "word") %>%
  filter(str_detect(word, "^[a-z]+$"), str_length(word) > 2)

# 2) Identify the top 10 TF–IDF terms
Amsterdam_terms <- top_regency %>%
  filter(Hotel_Name == "Hyatt Regency Amsterdam") %>%
  pull(word)

# 3) Calculate pairwise word co-occurrence
pairs_all <- tokens_amsterdam %>%
  pairwise_count(item = word, feature = review_id, sort = TRUE)

# 4) Filter only pairs associated with the top terms
network_edges <- pairs_all %>%
  filter((item1 %in% Amsterdam_terms | item2 %in% Amsterdam_terms) & n >= 2)

# 5) Format the data into nodes and edges
nodes <- tibble(name = unique(c(network_edges$item1, network_edges$item2)))
edges <- network_edges %>%
  rename(from = item1, to = item2)

# 6) Create a tbl_graph object and visualize the network
graph_full <- tbl_graph(nodes = nodes, edges = edges, directed = FALSE)

set.seed(2025)
ggraph(graph_full, layout = "fr") +
  geom_edge_link(aes(width = n), color = "gray70") +
  geom_node_point(aes(filter = name %in% Amsterdam_terms),
                  color = "tomato", size = 6) +
  geom_node_point(aes(filter = !(name %in% Amsterdam_terms)),
                  color = "steelblue", size = 4) +
  geom_node_text(aes(label = name), repel = TRUE) +
  scale_edge_width(range = c(0.5, 2)) +
  labs(title = "Co-occurrence Network for Hyatt Regency Amsterdam") +
  theme_void()

#🏙️ 1. Prime Central Location

The centre node sits at the heart of the network, frequently linked with hotel, location, city, quiet, and walking. The hotel’s location in central Amsterdam allows for easy walking access to nearly all major attractions—a key strength.

#🚇 2. Excellent Transportation Access

The metro and tram nodes act as hubs, connected to distance, walking, public, convenient, times, and day. Reviewers consistently highlight how close the hotel is to metro and tram stations, making tourist spots and airport travel incredibly convenient.

#🌊 3. Charm of the Canals

The canal node connects with view and friendly. Rooms with canal views paired with warm service offer a special experience from the moment the day begins.

#🍽️ 4. Vibrant Neighborhood

The neighbourhood node links to restaurant, food, modern, and stops. The area around the hotel is filled with modern eateries, local bars, and diverse cultural spaces.

#⭐ 5. Brand Trust & Hidden Gem Appeal

The brand node clusters with staff, comfortable, spacious, and hotel, reflecting Hyatt’s signature comfort and service. The gem node connects with nice, suggesting the hotel is seen as a “hidden gem” by many guests.

#🔍 6. Minor Drawbacks

The bother node is connected to close and quiet. A few reviews mention minor inconveniences like occasional noise or delays at check-in, though such feedback is rare compared to the overwhelmingly positive themes.

✨ Summary

Hyatt Regency Amsterdam is frequently praised as a “hidden gem in the heart of the city,” offering: A truly central location Outstanding transportation connectivity Romantic canal views A walkable neighborhood that reflects local life And Hyatt’s signature comfort and service—all in perfect balance.

library(dplyr)
library(tidytext)
library(widyr)
library(tidygraph)
library(ggraph)
#Same method as in the previous code
# 1) Extract reviews for Hyatt Regency London The Churchill and generate tokens
tokens_churchill <- regency_df %>%
  filter(Hotel_Name == "Hyatt Regency London The Churchill") %>%
  tibble::rowid_to_column("review_id") %>%
  unnest_tokens(word, full_text) %>%
  anti_join(stop_words,   by = "word") %>%
  anti_join(custom_stops, by = "word") %>%
  filter(str_detect(word, "^[a-z]+$"), str_length(word) > 2)

Churchill_terms <- top_regency %>%
  filter(Hotel_Name == "Hyatt Regency London The Churchill") %>%
  pull(word)

pairs_churchill <- tokens_churchill %>%
  pairwise_count(item    = word,
                 feature = review_id,
                 sort    = TRUE)

network_edges_churchill <- pairs_churchill %>%
  filter((item1 %in% Churchill_terms | item2 %in% Churchill_terms) & n >= 2)

nodes_churchill <- tibble(name = unique(c(network_edges_churchill$item1,
                                          network_edges_churchill$item2)))
edges_churchill <- network_edges_churchill %>%
  rename(from = item1, to = item2)
graph_churchill <- tbl_graph(nodes = nodes_churchill,
                             edges = edges_churchill,
                             directed = FALSE)

set.seed(2025)
ggraph(graph_churchill, layout = "fr") +
  geom_edge_link(aes(width = n), color = "gray70") +
  geom_node_point(aes(filter = name %in% Churchill_terms),
                  color = "tomato", size = 6) +
  geom_node_point(aes(filter = !(name %in% Churchill_terms)),
                  color = "steelblue", size = 4) +
  geom_node_text(aes(label = name), repel = TRUE) +
  scale_edge_width(range = c(0.5, 2)) +
  labs(title = "Co-occurrence Network for Hyatt Regency London The Churchill") +
  theme_void()

#🛋️ 1. Spacious Rooms

The size node has the highest number of connections—linked to terms like amenities, pillows, king, comfort, and excellent (16+ words in total). → Frequent mentions of king-sized beds, ample space, and thoughtful amenities highlight the room’s comfort and size as standout features.

#⏰ 2. Flexible Check-in & Check-out

The late node connects with check, requested, times, staff, helpful, and stay. → Guests often appreciated the staff’s prompt response to late check-in or check-out requests.

#📍 3. Prime Location

The portman node links with square, street, view, hotels, and card, emphasizing its proximity to Portman Square. The oxford node connects with street, staying, told, and perfect, highlighting easy access to Oxford Street and a “perfect location”.

#🍳 4. Breakfast Experience

The breakfast node is linked with friendly, service, clean, bathrooms, elevator, floor, keeping, slow, and professional. → While many noted the friendly service, some also mentioned crowded dining areas and slower service during breakfast.

#🚿 5. Bathroom Facilities

The appointed node connects with toilet. → Reviews praised the “well-appointed” bathroom, suggesting satisfaction with room facilities and cleanliness.

#💧 6. Water & Housekeeping

The water node is linked with words like free, daily, towels, bottles, housekeeping, washed, cleaned, housekeepers, rude, called, fast, comfy, and day. → Guests commented positively on daily complimentary water, fresh towels, and the speed and demeanor of housekeeping staff.

✨ Summary

Reviews of Hyatt Regency London The Churchill consistently highlight: Spacious rooms and thoughtful amenities (size) Flexible check-in/check-out experiences (late) A prime central location near Portman Square and Oxford Street (portman, oxford) A friendly but occasionally busy breakfast service (breakfast) Well-maintained bathrooms (appointed) And daily housekeeping with complimentary water and towels (water) These elements position the hotel as a refined and comfortable stay in central London.

library(dplyr)
library(tidytext)
library(widyr)
library(tidygraph)
library(ggraph)
#Same method as in the previous code
# 1) Extract reviews for Hyatt Regency Paris Etoile and generate tokens
tokens_paris <- regency_df %>%
  filter(Hotel_Name == "Hyatt Regency Paris Etoile") %>%
  tibble::rowid_to_column("review_id") %>%
  unnest_tokens(word, full_text) %>%
  anti_join(stop_words,   by = "word") %>%
  anti_join(custom_stops, by = "word") %>%
  filter(str_detect(word, "^[a-z]+$"), str_length(word) > 2)

Paris_terms <- top_regency %>%
  filter(Hotel_Name == "Hyatt Regency Paris Etoile") %>%
  pull(word)

pairs_paris <- tokens_paris %>%
  pairwise_count(item    = word,
                 feature = review_id,
                 sort    = TRUE)

network_edges_paris <- pairs_paris %>%
  filter((item1 %in% Paris_terms | item2 %in% Paris_terms) & n >= 7)
#지나치게 복잡해서 이 경우에만 7로 조정

nodes_paris <- tibble(name = unique(c(network_edges_paris$item1,
                                      network_edges_paris$item2)))
edges_paris <- network_edges_paris %>%
  rename(from = item1, to = item2)

graph_paris <- tbl_graph(nodes = nodes_paris,
                         edges = edges_paris,
                         directed = FALSE)

set.seed(2025)
ggraph(graph_paris, layout = "fr") +
  geom_edge_link(aes(width = n), color = "gray70") +
  geom_node_point(aes(filter = name %in% Paris_terms),
                  color = "tomato", size = 6) +
  geom_node_point(aes(filter = !(name %in% Paris_terms)),
                  color = "steelblue", size = 4) +
  geom_node_text(aes(label = name), repel = TRUE) +
  scale_edge_width(range = c(0.5, 2)) +
  labs(title = "Co-occurrence Network for Hyatt Regency Paris Etoile") +
  theme_void()

#🚇 1. Excellent Transportation Access

The metro node has the highest number of connections—linked to bus, station, airport, access, walk, line, shopping, mall, views, and easy. → The hotel’s location right in front of a metro station is one of the most frequently mentioned advantages.

#🗼 2. Proximity to Eiffel Tower & Arc de Triomphe

The eiffel node connects with arc, location, view, hotel, staff, helpful, friendly, and clean. The arc node links with eiffel, location, view, staff, helpful, and hotel. → Positioned between the Eiffel Tower and Arc de Triomphe, the hotel is praised for its stunning views and exceptionally central location.

#🏗️ 3. Ongoing Renovation Work

The renovation node is connected to ongoing, noisy, noise, informed, morning, wifi, furniture, time, booking, lot, and due. → Noise and renovation notices are consistently mentioned across reviews.

#🛠️ 4. Aging Facilities & Need for Updates

Nodes such as refurbishment, outdated, dated, and broken cluster with hotel, view, and staff. → Negative terms like “outdated,” “dated,” and “broken” rank high in TF–IDF scores, indicating frequent concerns about aging infrastructure.

#🥤 5. Beverage & In-Room Amenities

The water node connects with free, coffee, tea, kettle, bottle, facilities, shower, temperature, hot, cold, and bathroom. → Many reviewers compliment the complimentary water, in-room coffee/tea makers, and reliable hot/cold water.

#🛍️ 6. Access to Shopping & Leisure

The shopping and mall nodes are linked with metro, views, night, and bit. → Proximity to the metro, shopping centers, and walking paths makes it convenient for shopping, nighttime strolls, and leisure activities.

✨ Summary

Reviews of Hyatt Regency Paris Etoile highlight: Metro access as a top strength Spectacular views of the Eiffel Tower and Arc de Triomphe, along with a prime location Ongoing renovation noise and lack of advance notice Concerns over dated facilities and room conditions Complimentary drinks and amenities praised by guests Immediate proximity to shopping malls and leisure spots

🏨 Final Recommendation: Hyatt Regency Amsterdam

Ideally located in the heart of Amsterdam, allowing easy walkable access to major attractions Peaceful canal-view rooms and friendly service provide a relaxing atmosphere Excellent tram and metro access, making it easy to reach the airport and city outskirts Described as a “hidden gem” with a cozy, emotional vibe—perfect for solo travelers seeking calm in the city

Conclusion

I’m glad I was able to analyze the data in such an interesting way. It was rewarding to see that the results turned out to be meaningful. I also learned many new techniques using ChatGPT 😊

If I had more time, I feel confident that I could dive even deeper into the analysis. I’m motivated to explore even more methods going forward.

Analyzing Hotel Reviews: Emotions, Brands, and Recommendations

박예은

2025-06-18

1. Executive Summary

Analysis 1: Emotion vs. Reviewer Score

Analysis 2: Brand

Analysis 3: Recommendations

🧑‍💼 Name: Seohyun Lee (34 years old, UX Designer, based in Seoul)

✳ Hyatt World Member

✨ Summary

✨ Summary

✨ Summary

🏨 Final Recommendation: Hyatt Regency Amsterdam

Conclusion