Write text and code here.

Executive summary

While considering the data analysis questions, the most considerable point was whether it was possible to derive results that could be applied to hotel practice. It was hoped that the derived results could be helpful for the hospitality department that meets hotel guests in person and the marketing department in charge of hotel promotion. Therefore, through hotel review analysis, we tried to derive the strengths and weaknesses of each hotel as keywords, and we tried to grasp the image of a hotel that is different from other hotels.

First, let’s look at the results of frequency analysis. In Hotel Arena’s negative review, bathroom, construction, and breakfast (excluding meaningless keywords such as hotel and negative) were mentioned. Through this, it was found that the hotel had many negative opinions about the facilities and breakfast. In addition, in the positive review, staff and nice were mentioned the most. It can be seen that the staff who provide good service are creating a positive image.

In the K Hotel George negative review, breakfast, bed, etc. were mentioned. Similar to Hotel Arena, there were many negative opinions about the morning news and facilities. In the positive review, location was mentioned the most, indicating that a good location is a big advantage of the hotel.

Unlike other hotels, the tf-idf analysis results can confirm the frequently mentioned keywords. First, let’s look at the negative reviews of each hotel. In Hotel Arena’s negative review, construction is mentioned the most in an overwhelming number, so it seems necessary to improve this. In the K Hotel George negative review, there are several keywords related to meals such as eggs and cooked, so it seems necessary to improve food quality. In the actual review, there were many reviewers expressing regret about egg cooking.

In addition, tram and amsterdam ranked at the top of Hotel Arena’s positive review. It can be seen that the location of amsterdam is recognized as a differentiating advantage with good access to Tram. In the K Hotel George positive review, tube and london ranked at the top. It can be seen that the fact that it is close to the London subway tube and located in london is recognized as a differentiating advantage.

Reviewer analysis was also conducted. “Hotel_Reviews.csv” is data that not only provides hotel reviews, but also provides the characteristics of reviewers who left the reviews. In particular, this analysis focused on the factor of nationality. This is because cultural background, linguistic differences, etc. are very likely to affect reviews.

When looking at the distribution of reviewer nationality through No. 4, both hotels showed similar appearances. However, the big difference was the participation of reviewers with Irish nationality. In Hotel Arena reviews, Ireland ranked second among reviewer nationalities, but K K Hotel George was not included in the filtered data at all. Ireland is the country with the highest score in emotional score calculation (analysis 3). Therefore, Irish nationality reviewers participated more in Hotel Arena’s review, and it is possible that their high emotional score left had a positive effect on Hotel Arena’s overall emotional score.

This suggests that not only the services and facilities provided, but also other factors such as nationality may affect hotel reviews. Therefore, it can be assumed that a little wider consideration is needed when reflecting the review analysis results in practice.

Data background

The “Hotel_Reviews.csv” data I chose is European hotel review data, which contains 515,000 customer reviews and scores from 1,493 luxury hotels across Europe. The source of the data is Booking.com and describes a total of 17 fields.

The 17 fields are as follows. 1) Hotel address 2) Review publication date 3) Hotel average score 4) Hotel name 5) Reviewer nationality 6) Negative review 7) Total number of words in negative reviews 8) Positive reviews 9) Total number of words in positive reviews 10) Reviewer score 11) Number of reviews past 12) Total number of valid reviews held by hotels 13) Tag 14) Additional score 16) Latitude 17) Longitude

This data contains a lot of information related to reviewers as well as hotel reviews, and is useful not only for review analysis but also for understanding the characteristics of reviewers. The data contains reviews of 1493 luxury hotels that are too vast. Therefore, When I do hotel comparison analysis, I selected and analyzed only the data on two hotels.

Data loading, cleaning and preprocessing

After selecting only the necessary data from the existing data, it was reshaped into an appropriate form for frequency analysis, tf-idf analysis, and emotion analysis.

# Data of two hotels
data <- read.csv("Hotel_Reviews.csv")

data_hotels <- data %>% filter(Hotel_Name %in% c("Hotel Arena", "K K Hotel George")) %>% select(Hotel_Name, Reviewer_Nationality, Negative_Review, Positive_Review, Reviewer_Score)

data_hotels_ne <- data_hotels %>%
  unnest_tokens(input = Negative_Review,
                output = word) %>% 
  anti_join(stop_words)

## Joining with `by = join_by(word)`

frequency_ne <- data_hotels_ne %>% count(Hotel_Name, word, sort=T)

data_hotels_po <- data_hotels %>%
  unnest_tokens(input = Positive_Review,
                output = word) %>% 
  anti_join(stop_words)

## Joining with `by = join_by(word)`

frequency_po <- data_hotels_po %>% count(Hotel_Name, word, sort=T)


# Data of each hotel
data_Arena <- data %>% filter(Hotel_Name == "Hotel Arena") %>% select(Hotel_Name, Reviewer_Nationality, Negative_Review, Positive_Review, Reviewer_Score)

data_KK <- data %>% filter(Hotel_Name == "K K Hotel George") %>% select(Hotel_Name, Reviewer_Nationality, Negative_Review, Positive_Review, Reviewer_Score)


# Negative_Review and Negative_Review for each hotel
data_Arena_ne <- data_Arena %>% select(Hotel_Name,Negative_Review) %>% unnest_tokens(input = Negative_Review,
                output = word) %>% anti_join(stop_words)

## Joining with `by = join_by(word)`

data_Arena_po <- data_Arena %>% select(Hotel_Name,Positive_Review) %>% unnest_tokens(input = Positive_Review,
                output = word) %>% anti_join(stop_words)

## Joining with `by = join_by(word)`

data_KK_ne <- data_KK %>% select(Hotel_Name,Negative_Review) %>% unnest_tokens(input = Negative_Review,
                output = word) %>% anti_join(stop_words)

## Joining with `by = join_by(word)`

data_KK_po <- data_KK %>% select(Hotel_Name,Positive_Review) %>% unnest_tokens(input = Positive_Review,
                output = word) %>% anti_join(stop_words)

## Joining with `by = join_by(word)`

# New table for reviewer analysis
data2 <- data %>%
  select(Reviewer_Nationality, Reviewer_Score, Negative_Review, Positive_Review) %>%
  filter(!is.na(Reviewer_Nationality) & !is.na(Reviewer_Score))

data2 <- data2 %>%
  mutate(Review_Text = paste(Negative_Review, Positive_Review))

tidy_reviews <- data2 %>%
  unnest_tokens(word, Review_Text) %>% anti_join(stop_words)

## Joining with `by = join_by(word)`

reviewer_data <- data %>%
  select(Hotel_Name, Reviewer_Nationality) %>%
  filter(!is.na(Hotel_Name) & !is.na(Reviewer_Nationality)) %>% 
  filter(Hotel_Name %in% c("Hotel Arena", "K K Hotel George"))

Text data analysis

Individual analysis and figures

Anaysis and Figure 1 [Analysis of reviews by each hotel]

Analysis No. 1 is a frequency analysis that analyzes the positive or negative keywords most often mentioned in each hotel. Since it is the most basic way to grasp the strengths and weaknesses of each hotel as keywords, it was selected as analysis No. 1.

Four tables were created by separating each hotel’s positive and negative reviews from the existing data and tokenizing them. In addition, the graph was derived by visualizing the table. Lastly, the graphs were grouped by hotel so that the keywords of positive and negative reviews can be seen at a glance.

# count 

frequency_Arena_ne <- data_Arena_ne %>% count(Hotel_Name, word, sort=T)
frequency_Arena_po <- data_Arena_po %>% count(Hotel_Name, word, sort=T)

frequency_KK_ne <- data_KK_ne %>% count(Hotel_Name, word, sort=T)
frequency_KK_po <- data_KK_po %>% count(Hotel_Name, word, sort=T)


# Graphs of Hotel Arena (negative review)

plot1 <- frequency_Arena_ne %>% 
  slice_max(n, n = 10) %>% 
  arrange(n) %>% 
  ungroup %>% 
  mutate(word = reorder_within(word, n, Hotel_Name)) %>% 
  ggplot(aes(x = word, y = n, fill = Hotel_Name)) +
  geom_col(show.legend = FALSE) +
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(x = word, y = n, label = n), vjust = 0.5) + 
  scale_x_reordered() + 
  facet_wrap(~ Hotel_Name, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL,
       y = "Frequency",
       title = "Top 10 common words in negative review of Hotel Arena")

# Graphs of Hotel Arena (positive review)

plot2 <- frequency_Arena_po %>% 
  slice_max(n, n = 10) %>% 
  arrange(n) %>% 
  ungroup %>% 
  mutate(word = reorder_within(word, n, Hotel_Name)) %>% 
  ggplot(aes(x = word, y = n, fill = Hotel_Name)) +
  geom_col(show.legend = FALSE) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(x = word, y = n, label = n), vjust = 0.5) + 
  scale_x_reordered() + 
  facet_wrap(~ Hotel_Name, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL,
       y = "Frequency",
       title = "Top 10 common words in positive review of Hotel Arena")

grid.arrange(plot1, plot2, ncol = 2)

# Graphs of K K Hotel George (negative review)

plot1_1 <- frequency_KK_ne %>% 
  slice_max(n, n = 10) %>% 
  arrange(n) %>% 
  ungroup %>% 
  mutate(word = reorder_within(word, n, Hotel_Name)) %>% 
  ggplot(aes(x = word, y = n, fill = Hotel_Name)) +
  geom_col(show.legend = FALSE) +
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(x = word, y = n, label = n), vjust = 0.5) + 
  scale_x_reordered() + 
  facet_wrap(~ Hotel_Name, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL,
       y = "Frequency",
       title = "Top 10 common words in negative review of K K Hotel George")

# Graphs of K K Hotel George (positive review)

plot1_2 <- frequency_KK_po %>% 
  slice_max(n, n = 10) %>% 
  arrange(n) %>% 
  ungroup %>% 
  mutate(word = reorder_within(word, n, Hotel_Name)) %>% 
  ggplot(aes(x = word, y = n, fill = Hotel_Name)) +
  geom_col(show.legend = FALSE) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(x = word, y = n, label = n), vjust = 0.5) + 
  scale_x_reordered() + 
  facet_wrap(~ Hotel_Name, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL,
       y = "Frequency",
       title = "Top 10 common words in positive review of K K Hotel George")

grid.arrange(plot1_1, plot1_2, ncol = 2)

Anaysis and Figure 2 [Analysis of reviews by each hotel]

Analysis No. 2 is tf-idf analysis, which is an analysis that can grasp what keywords are characteristic of each hotel. Since relatively important words can be derived through tf-idf analysis compared to other hotels, it is possible to understand the differences between hotels. Since the unique characteristics and services of the hotel can be known, tf-idf analysis was selected as analysis No. 2.

Four tables were created with the tf_idf value of positive and negative reviews added for each hotel. In addition, a graph was derived by visualizing the table. After that, positive reviews were grouped with positive reviews, and negative reviews were grouped with negative reviews so that each hotel’s unique characteristics or elements could be identified.

# tf_idf

tf_idf_ne <- frequency_ne %>%
  bind_tf_idf(term = word,           
              document = Hotel_Name,  
              n = n) %>%             
  arrange(-tf_idf)

tf_idf_po <- frequency_po %>%
  bind_tf_idf(term = word,           
              document = Hotel_Name,  
              n = n) %>%             
  arrange(-tf_idf)


# Two Hotel
tf_idf_Arena_ne <- tf_idf_ne %>% filter(Hotel_Name == "Hotel Arena") %>% slice_max(tf_idf, n = 10)
tf_idf_Arena_po <- tf_idf_po %>% filter(Hotel_Name == "Hotel Arena") %>% slice_max(tf_idf, n = 10)

tf_idf_KK_ne <- tf_idf_ne %>% filter(Hotel_Name == "K K Hotel George") %>% slice_max(tf_idf, n = 10)
tf_idf_KK_po <- tf_idf_po %>% filter(Hotel_Name == "K K Hotel George") %>% slice_max(tf_idf, n = 10)


# Graphs of Hotel Arena

plot2_1 <- ggplot(tf_idf_Arena_ne, aes(x = reorder_within(word, tf_idf, Hotel_Name),
                  y = tf_idf,
                  fill = Hotel_Name)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ Hotel_Name, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL,
      title = "The top 10 words in tf-idf_NEGATIVE_Hotel Arena") 


plot2_2 <-ggplot(tf_idf_Arena_po, aes(x = reorder_within(word, tf_idf, Hotel_Name),
                  y = tf_idf,
                  fill = Hotel_Name)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ Hotel_Name, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL,
      title = "The top 10 words in tf-idf_POSITIVE_Hotel Arena") 


# Graphs of K K Hotel George

plot2_1_1 <- ggplot(tf_idf_KK_ne, aes(x = reorder_within(word, tf_idf, Hotel_Name),
                  y = tf_idf,
                  fill = Hotel_Name)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ Hotel_Name, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL,
      title = "The top 10 words in tf-idf_NEGATIVE_K K Hotel George") 


plot2_2_2 <-ggplot(tf_idf_KK_po, aes(x = reorder_within(word, tf_idf, Hotel_Name),
                  y = tf_idf,
                  fill = Hotel_Name)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ Hotel_Name, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL,
      title = "The top 10 words in tf-idf_POSITIVE_K K Hotel George") 


grid.arrange(plot2_1, plot2_1_1, ncol = 2)

grid.arrange(plot2_2, plot2_2_2, ncol = 2)

Anaysis and Figure 3 [Analysis of Reviewers by Nationality]

No. 3 analyzed the average emotional score of reviewers by nationality. Since nationality was judged to be a factor that can have a great influence on the review, reviewers were classified by nationality and the emotion dictionary “afinn” derived from numbers was used.

In order to increase the reliability of emotional scores, scores were derived by filtering only countries where more than 500 reviews were written. The graph is arranged from small numbers so that you can clearly see how different emotional scores are from country to country.

# load AFINN 
afinn <- get_sentiments("afinn")


# sentiment score
sentiment_scores <- tidy_reviews %>%
  inner_join(afinn, by = "word") %>%
  group_by(Reviewer_Nationality) %>%
  summarize(Average_Sentiment_Score = mean(value, na.rm = TRUE),
            Review_Count = n())

# filtered
sentiment_scores_filtered <- sentiment_scores %>%
  filter(Review_Count >= 500)


# Graphs of Average emotional score by nationality
ggplot(sentiment_scores_filtered, aes(x = reorder(Reviewer_Nationality, -Average_Sentiment_Score), y = Average_Sentiment_Score)) +
  geom_col(fill = "skyblue", show.legend = FALSE) + 
  coord_flip() +
  labs(title = "Average emotional score by nationality",
       x = "Nationality",
       y = "Average emotional score") +
  theme_minimal()

Anaysis and Figure 4 [Analysis of Reviewers by Nationality]

After examining the average emotional score of reviewers by nationality, No. 4 tried to understand how the nationalities of reviewers are distributed in the two hotel reviews discussed earlier. Since the number of countries is large, only the top countries were filtered and examined.

In order to see the distribution of reviewer nationalities of the two hotels at a glance, the graphs of the two hotels were placed together in one graph.

# Calculation of Nationality Distribution by Hotel
hotel_nationality <- reviewer_data %>%
  group_by(Hotel_Name, Reviewer_Nationality) %>%
  summarize(Review_Count = n()) %>%
  ungroup() %>%
  arrange(desc(Review_Count))

## `summarise()` has grouped output by 'Hotel_Name'. You can override using the
## `.groups` argument.

# filtering
top_nationalities <- hotel_nationality %>%
  group_by(Reviewer_Nationality) %>%
  summarize(Total_Review_Count = sum(Review_Count)) %>%
  top_n(15, Total_Review_Count) %>%
  pull(Reviewer_Nationality)

hotel_nationality_filtered <- hotel_nationality %>%
  filter(Reviewer_Nationality %in% top_nationalities)

# Graphs of Distribution of reviewers
ggplot(hotel_nationality_filtered, aes(x = reorder_within(Reviewer_Nationality, Review_Count, Hotel_Name), y = Review_Count, fill = Hotel_Name)) +
  geom_col(show.legend = TRUE) +
  facet_wrap(~ Hotel_Name, scales = "free_y", ncol = 1) +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Distribution of reviewers by nationality of Hotel Arena and K K Hotel George",
       x = "Nationality",
       y = "Number of reviews") +
  theme_minimal()

In showing the figures that you created, describe why you designed it the way you did. Why did you choose those colors, fonts, and other design elements? Does it convey truth?

Anaysis and Figure 1 Since it is an analysis that derives top 10 words for each positive and negative review, the most intuitive bar graph was selected, and numbers were displayed inside the graph. In addition, the positive review graph was selected as sky blue and the negative review graph was selected as red so that it could be intuitively understood.
Anaysis and Figure 2 The tf-idf analysis focuses on grasping the specificity of other hotels. Therefore, positive reviews were presented as a group of positive reviews, and negative reviews were presented as negative reviews. I think this will allow us to see the difference between hotels more clearly.
Anaysis and Figure 3 Since this is data containing multiple countries, rather than showing actual numbers, I tried to construct a graph that shows the differences between countries prominently. Therefore, small emotional scores were arranged first. Also, the emotional score was clearly displayed using sky blue rather than achromatic color.
Anaysis and Figure 4 The distribution of reviewer nationality of the two hotels was placed in one graph so that the difference between the distribution countries could be seen at a glance. I tried to make a difference by using different colors for each hotel.

Analysis of Two hotel reviews and Reviewers

Jae eun Suh

2024/06/16