Executive Summary

In this project, I analyzed hotel reviews in Europe to gain insights into customer sentiments and identify the top hotels based on positive and negative word frequencies. I also examined the relationship between review sentiments and the number of reviews per hotel. The main question of my final project is, “Does the number of reviews and the frequency of positive/negative words impact the average rating of a hotel?”

Data Background

The dataset used in this analysis is “Hotel_Reviews.csv,” which contains hotel reviews from various European hotels. The dataset includes information such as the hotel name, review scores, positive and negative reviews, and other relevant variables from Booking.com.

Data Loading, Cleaning, and Preprocessing

First, I loaded the dataset using the read_csv() function and stored it in the data variable. Then, I performed necessary data cleaning steps to remove irrelevant variables and handle missing data.

Text Data Analysis

To analyze the text data in the reviews, I utilized several libraries, including tidytext, dplyr, wordcloud, and ggplot2. I started by loading stop words to remove common words that do not carry significant meaning.

Next, I combined the positive and negative reviews into a single text column and removed stop words from the combined reviews. Then, I calculated the count of each term in the cleaned reviews.

data <- read_csv("Hotel_Reviews.csv")
## Rows: 515738 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Hotel_Address, Review_Date, Hotel_Name, Reviewer_Nationality, Negat...
## dbl (9): Additional_Number_of_Scoring, Average_Score, Review_Total_Negative_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Individual Analysis and Figures

Analysis and Figure 1: Word Cloud

To visually represent the most frequent terms in the reviews, I created a word cloud using the wordcloud() function. The word cloud shows the most commonly used words, with larger words indicating higher frequency. By visualizing the most frequently used words in the data, we can predict at a glance what aspects of the hotel customers are most interested in while using it.

# Analysis and Figure 1: Word Cloud
library(tidytext)
library(dplyr)
library(wordcloud)
## Loading required package: RColorBrewer
library(ggplot2)


data(stop_words)


all_reviews <- paste(data$Positive_Review, data$Negative_Review, sep = " ")


clean_reviews <- all_reviews %>%
  tibble(text = .) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")  


term_counts <- clean_reviews %>%
  count(word) %>%
  arrange(desc(n))  


wordcloud(term_counts$word, term_counts$n, scale = c(6, 0.5), max.words = 50, random.order = FALSE, colors = brewer.pal(8, "Dark2"))

Analysis and Figure 2: Top 20 Terms Bar Chart

To further explore the most frequent terms in the reviews, I selected the top 20 terms based on their count. I reordered the levels of the word factor and created a bar chart to display the frequency of these terms. The bar chart helps identify the most prevalent terms in the reviews. This graph indicates that customers pay a lot of attention to the staff. It also shows that services such as the hotel’s location and breakfast, as well as facilities like beds and bathrooms, are important to them.

# Analysis and Figure 2: Top 20 Terms Bar Chart

top_terms <- term_counts %>%
  top_n(20, n)  


top_terms$word <- factor(top_terms$word, levels = rev(top_terms$word))


ggplot(top_terms, aes(x = word, y = n, fill = word)) +
  geom_col() +
  labs(x = "Term", y = "Count", title = "Top 20 Terms") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Analysis and Figure 3: Count of Positive and Negative Words

To analyze the sentiment in the reviews, I performed sentiment analysis on the positive and negative reviews separately. I calculated the frequency of positive and negative words using the AFINN lexicon. Then, I created a bar plot to compare the counts of positive and negative words, highlighting the sentiment distribution. It is evident that positive words are used more frequently than negative words, indicating an overall positive evaluation from customers.

# Analysis and Figure 3: Count of Positive and Negative Words

data <- mutate(data, Positive_Review = as.character(Positive_Review),
                    Negative_Review = as.character(Negative_Review))


positive_word_freq <- data %>%
  unnest_tokens(output = "word", input = Positive_Review) %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  count(Sentiment = "Positive") %>%
  rename(Count = n)


negative_word_freq <- data %>%
  unnest_tokens(output = "word", input = Negative_Review) %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  count(Sentiment = "Negative") %>%
  rename(Count = n)


word_freq <- bind_rows(positive_word_freq, negative_word_freq)


ggplot(word_freq, aes(x = Sentiment, y = Count, fill = Sentiment)) +
  geom_bar(stat = "identity") +
  labs(x = "Sentiment", y = "Count", fill = "Sentiment") +
  scale_fill_manual(values = c("Positive" = "blue", "Negative" = "red")) +
  ggtitle("Count of Positive and Negative Words")

Analysis and Figure 4: Number of Reviews per Hotel (Top 20)

To examine the popularity of hotels based on the number of reviews, I calculated the count of reviews per hotel. I selected the top 20 hotels with the highest review counts and created a bar plot to visualize the distribution. The bar plot helps identify the hotels with the most reviews. To determine if there is a meaningful correlation between the number of reviews and the hotel’s rating, an investigation was conducted.

# Analysis and Figure 4: Number of Reviews per Hotel (Top 20)

df <- read.csv("Hotel_Reviews.csv", header = TRUE, stringsAsFactors = FALSE)


hotel_counts <- table(df$Hotel_Name)


top_hotels <- names(hotel_counts)[order(hotel_counts, decreasing = TRUE)][1:20]


df_top_20 <- df[df$Hotel_Name %in% top_hotels, ]


df_top_20$Hotel_Name <- factor(df_top_20$Hotel_Name, levels = rev(top_hotels))


ggplot(df_top_20, aes(x = Hotel_Name)) +
  geom_bar(fill = "blue") +
  labs(title = "Number of Reviews per Hotel (Top 20)",
       x = "Hotel Name",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")) +
  coord_flip()

Analysis and Figure 5: Top 20 Hotels by Average Score

To identify the top hotels based on average review scores, I calculated the average score for each hotel and selected the top 20 hotels. Then, I presented the list of top 20 hotels based on their average scores. Compared to the previous graph, it can be seen that the number of reviews and customers’ average rating are not directly proportional.

# Analysis and Figure 5: Top 20 Hotels by Average Score

top_20_hotels <- df %>%
  group_by(Hotel_Name) %>%
  summarise(Average_Score = mean(Average_Score, na.rm = TRUE)) %>%
  arrange(desc(Average_Score)) %>%
  slice(1:20)

top_20_hotels
## # A tibble: 20 × 2
##    Hotel_Name                                   Average_Score
##    <chr>                                                <dbl>
##  1 Ritz Paris                                             9.8
##  2 41                                                     9.6
##  3 H tel de La Tamise Esprit de France                    9.6
##  4 H10 Casa Mimosa 4 Sup                                  9.6
##  5 Haymarket Hotel                                        9.6
##  6 Hotel Casa Camper                                      9.6
##  7 Hotel The Serras                                       9.6
##  8 Charlotte Street Hotel                                 9.5
##  9 Ham Yard Hotel                                         9.5
## 10 Hotel Sacher Wien                                      9.5
## 11 Hotel The Peninsula Paris                              9.5
## 12 Le Narcisse Blanc Spa                                  9.5
## 13 Mercer Hotel Barcelona                                 9.5
## 14 Milestone Hotel Kensington                             9.5
## 15 Palais Coburg Residenz                                 9.5
## 16 Taj 51 Buckingham Gate Suites and Residences           9.5
## 17 The Soho Hotel                                         9.5
## 18 Waldorf Astoria Amsterdam                              9.5
## 19 45 Park Lane Dorchester Collection                     9.4
## 20 Batty Langley s                                        9.4

Analysis and Figure 6: Lowest 20 Hotels by Average Score

Similarly, I identified the lowest-rated hotels by calculating the average score for each hotel and selecting the lowest 20. I presented the list of the lowest 20 hotels based on their average scores. Compared to the previous graph, it doesn’t seem that customers’ average rating is significantly influenced by an increase in the number of reviews.

# Analysis and Figure 6: Lowest 20 Hotels by Average Score

lowest_20_hotels <- df %>%
  group_by(Hotel_Name) %>%
  summarise(Average_Score = mean(Average_Score, na.rm = TRUE)) %>%
  arrange(Average_Score) %>%
  slice(1:20)

lowest_20_hotels
## # A tibble: 20 × 2
##    Hotel_Name                                   Average_Score
##    <chr>                                                <dbl>
##  1 Hotel Liberty                                          5.2
##  2 Hotel Cavendish                                        6.4
##  3 Savoy Hotel Amsterdam                                  6.4
##  4 Best Western Maitrise Hotel Edgware Road               6.6
##  5 The Tophams Hotel                                      6.6
##  6 Commodore Hotel                                        6.7
##  7 Ibis Styles Milano Palmanova                           6.7
##  8 Bloomsbury Palace Hotel                                6.8
##  9 Villa Eugenie                                          6.8
## 10 Gainsborough Hotel                                     6.9
## 11 Hallmark Hotel London Chigwell Prince Regent           6.9
## 12 Idea Hotel Milano San Siro                             6.9
## 13 Eurohotel Diagonal Port                                7  
## 14 Gran Hotel Barcino                                     7  
## 15 Henry VIII                                             7  
## 16 Hotel Royal Elys es                                    7  
## 17 IH Hotels Milano Lorenteggio                           7  
## 18 London Elizabeth Hotel                                 7  
## 19 NH Carlton Amsterdam                                   7  
## 20 Park Lane Mews Hotel                                   7

Analysis and Figure 7: Positive Frequency - Negative Frequency

To compare the positive and negative sentiment in the reviews for each hotel, I calculated the positive and negative word frequencies. I subtracted the negative frequency from the positive frequency and created a bar plot to visualize the differences. The bar plot provides insights into the balance of positive and negative sentiments for each hotel. The frequency of positive and negative words was calculated for hotels with a high number of reviews, and the differences were compared. While hotels with a high number of reviews generally had more positive words, there were cases like ‘Britannia International Hotel Canary Wharf’ where the opposite was true, and there were also cases where the difference between positive and negative was minimal. Overall, hotels with a high number of reviews tended to have more positive evaluations, but it cannot be generalized.

# Analysis and Figure 7: Positive Frequency - Negative Frequency

hotel_counts <- table(df$Hotel_Name)


top_hotels <- names(hotel_counts)[order(hotel_counts, decreasing = TRUE)][1:20]


df_top_20 <- df[df$Hotel_Name %in% top_hotels, ]


positive_word_freq <- df_top_20 %>%
  mutate(Positive_Review = as.character(Positive_Review)) %>%
  unnest_tokens(output = "word", input = Positive_Review) %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(Hotel_Name, Review_Date) %>%
  summarise(Positive_Frequency = n())
## `summarise()` has grouped output by 'Hotel_Name'. You can override using the
## `.groups` argument.
negative_word_freq <- df_top_20 %>%
  mutate(Negative_Review = as.character(Negative_Review)) %>%
  unnest_tokens(output = "word", input = Negative_Review) %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(Hotel_Name, Review_Date) %>%
  summarise(Negative_Frequency = n())
## `summarise()` has grouped output by 'Hotel_Name'. You can override using the
## `.groups` argument.
word_freq <- left_join(positive_word_freq, negative_word_freq, by = c("Hotel_Name", "Review_Date"))


summary <- word_freq %>%
  group_by(Hotel_Name) %>%
  summarise(
    Positive_Frequency = sum(Positive_Frequency, na.rm = TRUE),
    Negative_Frequency = sum(Negative_Frequency, na.rm = TRUE),
    Total_Frequency = Positive_Frequency + Negative_Frequency
  ) %>%
  arrange(desc(Total_Frequency))


summary
## # A tibble: 20 × 4
##    Hotel_Name              Positive_Frequency Negative_Frequency Total_Frequency
##    <chr>                                <int>              <int>           <int>
##  1 Britannia Internationa…               7475               9048           16523
##  2 Park Plaza Westminster…               8782               6611           15393
##  3 Strand Palace Hotel                   7749               5914           13663
##  4 Copthorne Tara Hotel L…               6756               5371           12127
##  5 DoubleTree by Hilton H…               7014               4595           11609
##  6 Grand Royale London Hy…               6050               4140           10190
##  7 Holiday Inn London Ken…               5134               4705            9839
##  8 Intercontinental Londo…               6253               3322            9575
##  9 Hilton London Metropole               4368               4412            8780
## 10 Millennium Gloucester …               4377               4325            8702
## 11 DoubleTree by Hilton L…               4914               3701            8615
## 12 Park Grand Paddington …               5147               3152            8299
## 13 Park Plaza County Hall…               4717               3474            8191
## 14 M by Montcalm Shoredit…               4961               3214            8175
## 15 Blakemore Hyde Park                   4883               3171            8054
## 16 Hotel Da Vinci                        4459               3536            7995
## 17 Park Plaza London Rive…               4471               3504            7975
## 18 Park Grand London Kens…               4717               3171            7888
## 19 Hilton London Wembley                 4413               3347            7760
## 20 St James Court A Taj H…               4617               2922            7539
summary <- summary[order(summary$Positive_Frequency - summary$Negative_Frequency, decreasing = TRUE), ]


ggplot(summary, aes(x = reorder(Hotel_Name, Positive_Frequency - Negative_Frequency), y = Positive_Frequency - Negative_Frequency)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Positive Frequency - Negative Frequency",
       x = "Hotel Name",
       y = "Positive Frequency - Negative Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")) +
  coord_flip()

Analysis and Figure 8: Top 20 Hotels with the Highest Frequency of Positive Words

I also focused on the frequency of positive words in the reviews for each hotel. I calculated the frequency of positive words and selected the top 20 hotels. Then, I created a bar plot to show the hotels with the highest positive word frequencies. Compared to the previous graph, having a high frequency of positive words does not necessarily mean a higher rating according to customers’ evaluations.

# Analysis and Figure 8: Top 20 Hotels with the Highest Frequency of Positive Words


df_clean <- df %>%
  mutate(Positive_Review = as.character(Positive_Review)) %>%
  unnest_tokens(output = "word", input = Positive_Review) %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  count(Hotel_Name, sort = TRUE)  # Count positive words per hotel


top_hotels <- df_clean %>%
  top_n(20)
## Selecting by n
library(stringr)


ggplot(top_hotels, aes(x = reorder(Hotel_Name, n), y = n)) +
  geom_bar(fill = "blue", stat = "identity") +
  labs(title = str_wrap("Top 20 Hotels with the Highest Frequency of Positive Words", width = 40),
       x = "Hotel Name",
       y = "Positive Word Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 12, vjust = 1),
        plot.margin = margin(1.5, 1.5, 1.5, 1.5, "cm")) +
  coord_flip()

Analysis and Figure 9: Top 20 Hotels with the Highest Frequency of Negative Words

Similarly, I analyzed the frequency of negative words in the reviews for each hotel. I calculated the frequency of negative words and selected the top 20 hotels. I created a bar plot to show the hotels with the highest negative word frequencies. Compared to the previous graph, having a high frequency of negative words does not necessarily mean a lower average rating according to customers’ evaluations.

# Analysis and Figure 9: Top 20 Hotels with the Highest Frequency of Negative Words

df_clean_negative <- df %>%
  mutate(Negative_Review = as.character(Negative_Review)) %>%
  unnest_tokens(output = "word", input = Negative_Review) %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  count(Hotel_Name, sort = TRUE)  # Count negative words per hotel


top_hotels_negative <- df_clean_negative %>%
  top_n(20)
## Selecting by n
ggplot(top_hotels_negative, aes(x = reorder(Hotel_Name, n), y = n)) +
  geom_bar(fill = "red", stat = "identity") +
  labs(title = str_wrap("Top 20 Hotels with the Highest Frequency of Negative Words", width = 40),
       x = "Hotel Name",
       y = "Negative Word Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 12, vjust = 1),
        plot.margin = margin(1.5, 1.5, 1.5, 1.5, "cm")) +
  coord_flip()

Conclusion

In conclusion, this project provides valuable insights into hotel reviews in Europe. I analyzed the sentiment distribution, identified the top hotels based on various criteria, and explored the relationship between review sentiments and the number of reviews per hotel. The findings can be used to understand customer perceptions and make informed decisions in the accommodation industry. After conducting various comparisons, it seems that there is no significant correlation between the number of reviews and the average rating based on customers’ evaluations. It was also speculated that more reviews would lead to better ratings and higher visibility, but that does not always seem to be the case. The frequency of words also does not have a significant impact on the average rating. It appears that relying solely on the number of reviews as a criterion for evaluating hotels is insufficient according to customers’ observations.