In this project, I analyzed hotel reviews in Europe to gain insights into customer sentiments and identify the top hotels based on positive and negative word frequencies. I also examined the relationship between review sentiments and the number of reviews per hotel. The main question of my final project is, “Does the number of reviews and the frequency of positive/negative words impact the average rating of a hotel?”
The dataset used in this analysis is “Hotel_Reviews.csv,” which contains hotel reviews from various European hotels. The dataset includes information such as the hotel name, review scores, positive and negative reviews, and other relevant variables from Booking.com.
First, I loaded the dataset using the read_csv()
function and stored it in the data
variable. Then, I
performed necessary data cleaning steps to remove irrelevant variables
and handle missing data.
To analyze the text data in the reviews, I utilized several
libraries, including tidytext
, dplyr
,
wordcloud
, and ggplot2
. I started by loading
stop words to remove common words that do not carry significant
meaning.
Next, I combined the positive and negative reviews into a single text column and removed stop words from the combined reviews. Then, I calculated the count of each term in the cleaned reviews.
data <- read_csv("Hotel_Reviews.csv")
## Rows: 515738 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Hotel_Address, Review_Date, Hotel_Name, Reviewer_Nationality, Negat...
## dbl (9): Additional_Number_of_Scoring, Average_Score, Review_Total_Negative_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To visually represent the most frequent terms in the reviews, I
created a word cloud using the wordcloud()
function. The
word cloud shows the most commonly used words, with larger words
indicating higher frequency. By visualizing the most frequently used
words in the data, we can predict at a glance what aspects of the hotel
customers are most interested in while using it.
# Analysis and Figure 1: Word Cloud
library(tidytext)
library(dplyr)
library(wordcloud)
## Loading required package: RColorBrewer
library(ggplot2)
data(stop_words)
all_reviews <- paste(data$Positive_Review, data$Negative_Review, sep = " ")
clean_reviews <- all_reviews %>%
tibble(text = .) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
term_counts <- clean_reviews %>%
count(word) %>%
arrange(desc(n))
wordcloud(term_counts$word, term_counts$n, scale = c(6, 0.5), max.words = 50, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
To further explore the most frequent terms in the reviews, I selected the top 20 terms based on their count. I reordered the levels of the word factor and created a bar chart to display the frequency of these terms. The bar chart helps identify the most prevalent terms in the reviews. This graph indicates that customers pay a lot of attention to the staff. It also shows that services such as the hotel’s location and breakfast, as well as facilities like beds and bathrooms, are important to them.
# Analysis and Figure 2: Top 20 Terms Bar Chart
top_terms <- term_counts %>%
top_n(20, n)
top_terms$word <- factor(top_terms$word, levels = rev(top_terms$word))
ggplot(top_terms, aes(x = word, y = n, fill = word)) +
geom_col() +
labs(x = "Term", y = "Count", title = "Top 20 Terms") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To analyze the sentiment in the reviews, I performed sentiment analysis on the positive and negative reviews separately. I calculated the frequency of positive and negative words using the AFINN lexicon. Then, I created a bar plot to compare the counts of positive and negative words, highlighting the sentiment distribution. It is evident that positive words are used more frequently than negative words, indicating an overall positive evaluation from customers.
# Analysis and Figure 3: Count of Positive and Negative Words
data <- mutate(data, Positive_Review = as.character(Positive_Review),
Negative_Review = as.character(Negative_Review))
positive_word_freq <- data %>%
unnest_tokens(output = "word", input = Positive_Review) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
count(Sentiment = "Positive") %>%
rename(Count = n)
negative_word_freq <- data %>%
unnest_tokens(output = "word", input = Negative_Review) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
count(Sentiment = "Negative") %>%
rename(Count = n)
word_freq <- bind_rows(positive_word_freq, negative_word_freq)
ggplot(word_freq, aes(x = Sentiment, y = Count, fill = Sentiment)) +
geom_bar(stat = "identity") +
labs(x = "Sentiment", y = "Count", fill = "Sentiment") +
scale_fill_manual(values = c("Positive" = "blue", "Negative" = "red")) +
ggtitle("Count of Positive and Negative Words")
To examine the popularity of hotels based on the number of reviews, I calculated the count of reviews per hotel. I selected the top 20 hotels with the highest review counts and created a bar plot to visualize the distribution. The bar plot helps identify the hotels with the most reviews. To determine if there is a meaningful correlation between the number of reviews and the hotel’s rating, an investigation was conducted.
# Analysis and Figure 4: Number of Reviews per Hotel (Top 20)
df <- read.csv("Hotel_Reviews.csv", header = TRUE, stringsAsFactors = FALSE)
hotel_counts <- table(df$Hotel_Name)
top_hotels <- names(hotel_counts)[order(hotel_counts, decreasing = TRUE)][1:20]
df_top_20 <- df[df$Hotel_Name %in% top_hotels, ]
df_top_20$Hotel_Name <- factor(df_top_20$Hotel_Name, levels = rev(top_hotels))
ggplot(df_top_20, aes(x = Hotel_Name)) +
geom_bar(fill = "blue") +
labs(title = "Number of Reviews per Hotel (Top 20)",
x = "Hotel Name",
y = "Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")) +
coord_flip()
To identify the top hotels based on average review scores, I calculated the average score for each hotel and selected the top 20 hotels. Then, I presented the list of top 20 hotels based on their average scores. Compared to the previous graph, it can be seen that the number of reviews and customers’ average rating are not directly proportional.
# Analysis and Figure 5: Top 20 Hotels by Average Score
top_20_hotels <- df %>%
group_by(Hotel_Name) %>%
summarise(Average_Score = mean(Average_Score, na.rm = TRUE)) %>%
arrange(desc(Average_Score)) %>%
slice(1:20)
top_20_hotels
## # A tibble: 20 × 2
## Hotel_Name Average_Score
## <chr> <dbl>
## 1 Ritz Paris 9.8
## 2 41 9.6
## 3 H tel de La Tamise Esprit de France 9.6
## 4 H10 Casa Mimosa 4 Sup 9.6
## 5 Haymarket Hotel 9.6
## 6 Hotel Casa Camper 9.6
## 7 Hotel The Serras 9.6
## 8 Charlotte Street Hotel 9.5
## 9 Ham Yard Hotel 9.5
## 10 Hotel Sacher Wien 9.5
## 11 Hotel The Peninsula Paris 9.5
## 12 Le Narcisse Blanc Spa 9.5
## 13 Mercer Hotel Barcelona 9.5
## 14 Milestone Hotel Kensington 9.5
## 15 Palais Coburg Residenz 9.5
## 16 Taj 51 Buckingham Gate Suites and Residences 9.5
## 17 The Soho Hotel 9.5
## 18 Waldorf Astoria Amsterdam 9.5
## 19 45 Park Lane Dorchester Collection 9.4
## 20 Batty Langley s 9.4
Similarly, I identified the lowest-rated hotels by calculating the average score for each hotel and selecting the lowest 20. I presented the list of the lowest 20 hotels based on their average scores. Compared to the previous graph, it doesn’t seem that customers’ average rating is significantly influenced by an increase in the number of reviews.
# Analysis and Figure 6: Lowest 20 Hotels by Average Score
lowest_20_hotels <- df %>%
group_by(Hotel_Name) %>%
summarise(Average_Score = mean(Average_Score, na.rm = TRUE)) %>%
arrange(Average_Score) %>%
slice(1:20)
lowest_20_hotels
## # A tibble: 20 × 2
## Hotel_Name Average_Score
## <chr> <dbl>
## 1 Hotel Liberty 5.2
## 2 Hotel Cavendish 6.4
## 3 Savoy Hotel Amsterdam 6.4
## 4 Best Western Maitrise Hotel Edgware Road 6.6
## 5 The Tophams Hotel 6.6
## 6 Commodore Hotel 6.7
## 7 Ibis Styles Milano Palmanova 6.7
## 8 Bloomsbury Palace Hotel 6.8
## 9 Villa Eugenie 6.8
## 10 Gainsborough Hotel 6.9
## 11 Hallmark Hotel London Chigwell Prince Regent 6.9
## 12 Idea Hotel Milano San Siro 6.9
## 13 Eurohotel Diagonal Port 7
## 14 Gran Hotel Barcino 7
## 15 Henry VIII 7
## 16 Hotel Royal Elys es 7
## 17 IH Hotels Milano Lorenteggio 7
## 18 London Elizabeth Hotel 7
## 19 NH Carlton Amsterdam 7
## 20 Park Lane Mews Hotel 7
To compare the positive and negative sentiment in the reviews for each hotel, I calculated the positive and negative word frequencies. I subtracted the negative frequency from the positive frequency and created a bar plot to visualize the differences. The bar plot provides insights into the balance of positive and negative sentiments for each hotel. The frequency of positive and negative words was calculated for hotels with a high number of reviews, and the differences were compared. While hotels with a high number of reviews generally had more positive words, there were cases like ‘Britannia International Hotel Canary Wharf’ where the opposite was true, and there were also cases where the difference between positive and negative was minimal. Overall, hotels with a high number of reviews tended to have more positive evaluations, but it cannot be generalized.
# Analysis and Figure 7: Positive Frequency - Negative Frequency
hotel_counts <- table(df$Hotel_Name)
top_hotels <- names(hotel_counts)[order(hotel_counts, decreasing = TRUE)][1:20]
df_top_20 <- df[df$Hotel_Name %in% top_hotels, ]
positive_word_freq <- df_top_20 %>%
mutate(Positive_Review = as.character(Positive_Review)) %>%
unnest_tokens(output = "word", input = Positive_Review) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(Hotel_Name, Review_Date) %>%
summarise(Positive_Frequency = n())
## `summarise()` has grouped output by 'Hotel_Name'. You can override using the
## `.groups` argument.
negative_word_freq <- df_top_20 %>%
mutate(Negative_Review = as.character(Negative_Review)) %>%
unnest_tokens(output = "word", input = Negative_Review) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(Hotel_Name, Review_Date) %>%
summarise(Negative_Frequency = n())
## `summarise()` has grouped output by 'Hotel_Name'. You can override using the
## `.groups` argument.
word_freq <- left_join(positive_word_freq, negative_word_freq, by = c("Hotel_Name", "Review_Date"))
summary <- word_freq %>%
group_by(Hotel_Name) %>%
summarise(
Positive_Frequency = sum(Positive_Frequency, na.rm = TRUE),
Negative_Frequency = sum(Negative_Frequency, na.rm = TRUE),
Total_Frequency = Positive_Frequency + Negative_Frequency
) %>%
arrange(desc(Total_Frequency))
summary
## # A tibble: 20 × 4
## Hotel_Name Positive_Frequency Negative_Frequency Total_Frequency
## <chr> <int> <int> <int>
## 1 Britannia Internationa… 7475 9048 16523
## 2 Park Plaza Westminster… 8782 6611 15393
## 3 Strand Palace Hotel 7749 5914 13663
## 4 Copthorne Tara Hotel L… 6756 5371 12127
## 5 DoubleTree by Hilton H… 7014 4595 11609
## 6 Grand Royale London Hy… 6050 4140 10190
## 7 Holiday Inn London Ken… 5134 4705 9839
## 8 Intercontinental Londo… 6253 3322 9575
## 9 Hilton London Metropole 4368 4412 8780
## 10 Millennium Gloucester … 4377 4325 8702
## 11 DoubleTree by Hilton L… 4914 3701 8615
## 12 Park Grand Paddington … 5147 3152 8299
## 13 Park Plaza County Hall… 4717 3474 8191
## 14 M by Montcalm Shoredit… 4961 3214 8175
## 15 Blakemore Hyde Park 4883 3171 8054
## 16 Hotel Da Vinci 4459 3536 7995
## 17 Park Plaza London Rive… 4471 3504 7975
## 18 Park Grand London Kens… 4717 3171 7888
## 19 Hilton London Wembley 4413 3347 7760
## 20 St James Court A Taj H… 4617 2922 7539
summary <- summary[order(summary$Positive_Frequency - summary$Negative_Frequency, decreasing = TRUE), ]
ggplot(summary, aes(x = reorder(Hotel_Name, Positive_Frequency - Negative_Frequency), y = Positive_Frequency - Negative_Frequency)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Positive Frequency - Negative Frequency",
x = "Hotel Name",
y = "Positive Frequency - Negative Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.title.x = element_blank(),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm")) +
coord_flip()
I also focused on the frequency of positive words in the reviews for each hotel. I calculated the frequency of positive words and selected the top 20 hotels. Then, I created a bar plot to show the hotels with the highest positive word frequencies. Compared to the previous graph, having a high frequency of positive words does not necessarily mean a higher rating according to customers’ evaluations.
# Analysis and Figure 8: Top 20 Hotels with the Highest Frequency of Positive Words
df_clean <- df %>%
mutate(Positive_Review = as.character(Positive_Review)) %>%
unnest_tokens(output = "word", input = Positive_Review) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
count(Hotel_Name, sort = TRUE) # Count positive words per hotel
top_hotels <- df_clean %>%
top_n(20)
## Selecting by n
library(stringr)
ggplot(top_hotels, aes(x = reorder(Hotel_Name, n), y = n)) +
geom_bar(fill = "blue", stat = "identity") +
labs(title = str_wrap("Top 20 Hotels with the Highest Frequency of Positive Words", width = 40),
x = "Hotel Name",
y = "Positive Word Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.title = element_text(hjust = 0.5, size = 12, vjust = 1),
plot.margin = margin(1.5, 1.5, 1.5, 1.5, "cm")) +
coord_flip()
Similarly, I analyzed the frequency of negative words in the reviews for each hotel. I calculated the frequency of negative words and selected the top 20 hotels. I created a bar plot to show the hotels with the highest negative word frequencies. Compared to the previous graph, having a high frequency of negative words does not necessarily mean a lower average rating according to customers’ evaluations.
# Analysis and Figure 9: Top 20 Hotels with the Highest Frequency of Negative Words
df_clean_negative <- df %>%
mutate(Negative_Review = as.character(Negative_Review)) %>%
unnest_tokens(output = "word", input = Negative_Review) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
count(Hotel_Name, sort = TRUE) # Count negative words per hotel
top_hotels_negative <- df_clean_negative %>%
top_n(20)
## Selecting by n
ggplot(top_hotels_negative, aes(x = reorder(Hotel_Name, n), y = n)) +
geom_bar(fill = "red", stat = "identity") +
labs(title = str_wrap("Top 20 Hotels with the Highest Frequency of Negative Words", width = 40),
x = "Hotel Name",
y = "Negative Word Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.title = element_text(hjust = 0.5, size = 12, vjust = 1),
plot.margin = margin(1.5, 1.5, 1.5, 1.5, "cm")) +
coord_flip()
In conclusion, this project provides valuable insights into hotel reviews in Europe. I analyzed the sentiment distribution, identified the top hotels based on various criteria, and explored the relationship between review sentiments and the number of reviews per hotel. The findings can be used to understand customer perceptions and make informed decisions in the accommodation industry. After conducting various comparisons, it seems that there is no significant correlation between the number of reviews and the average rating based on customers’ evaluations. It was also speculated that more reviews would lead to better ratings and higher visibility, but that does not always seem to be the case. The frequency of words also does not have a significant impact on the average rating. It appears that relying solely on the number of reviews as a criterion for evaluating hotels is insufficient according to customers’ observations.