This project analyzes 515,000 hotel reviews to investigate the influence of seasonal cycles on the European traveler experience.
Research Goal: To quantify seasonal effects on sentiment/complaints and compare patterns across European regions.
Finding 1 (Sentiment Cycle): Sentiment scores for major cities follow a distinct seasonal cycle, peaking in Spring and declining significantly during late summer, suggesting city-specific peak season strain.
Finding 2 (Operational Shift):Operational pressure systematically shifts with the calendar: Noise complaints surge in summer, establishing a key summer challenge, while temperature/facility issues dominate the winter season.
Finding 3 (Regions): Northern/Central Europe show high stability; Southern Europe exhibits greater seasonal rating volatility.
Conclusion: Seasonality is a critical factor influencing both the emotional quality and practical challenges of the tourist experience.
This project is guided by two core research questions aimed at understanding the systematic impact of seasonal cycles on the quality and nature of the European hotel guest experience.
How do seasonal changes affect hotel review sentiments and complaint types, and what “seasonal experience identities” emerge by city? (Addressed by Figure 1 and Figure 2)
Which forms of dissatisfaction (noise, cleanliness, service, facility issues) vary systematically across northern, southern, and central Europe?(Addressed by Figure 2 and Figure 3)
The analysis uses the 515K European Hotel Reviews dataset.
Scope: Contains 515,000 customer reviews and scores for 1,493 luxury hotels across Europe.
Suitability: Ideal for seasonal study due to the inclusion of Review Date and rich review text fields.
Source: Data was scraped from Booking.com (publicly available).
# 1. Load Libraries
library(tidyverse)
library(lubridate)
library(tidytext)
library(stringr)
library(forcats)
library(scales)
theme_set(theme_minimal())
set.seed(1234)
# Create images folder if it doesn't exist (prevents ggsave errors)
if(!dir.exists("images")) dir.create("images")
# 2. Load Data
hotel_raw <- readr::read_csv("Hotel-Reviews.csv")
glimpse(hotel_raw)
## Rows: 515,738
## Columns: 17
## $ Hotel_Address <chr> "s Gravesandestraat 55 Oost…
## $ Additional_Number_of_Scoring <dbl> 194, 194, 194, 194, 194, 19…
## $ Review_Date <chr> "8/3/2017", "8/3/2017", "7/…
## $ Average_Score <dbl> 7.7, 7.7, 7.7, 7.7, 7.7, 7.…
## $ Hotel_Name <chr> "Hotel Arena", "Hotel Arena…
## $ Reviewer_Nationality <chr> "Russia", "Ireland", "Austr…
## $ Negative_Review <chr> "I am so angry that i made …
## $ Review_Total_Negative_Word_Counts <dbl> 397, 0, 42, 210, 140, 17, 3…
## $ Total_Number_of_Reviews <dbl> 1403, 1403, 1403, 1403, 140…
## $ Positive_Review <chr> "Only the park outside of t…
## $ Review_Total_Positive_Word_Counts <dbl> 11, 105, 21, 26, 8, 20, 18,…
## $ Total_Number_of_Reviews_Reviewer_Has_Given <dbl> 7, 7, 9, 1, 3, 1, 6, 1, 3, …
## $ Reviewer_Score <dbl> 2.9, 7.5, 7.1, 3.8, 6.7, 6.…
## $ Tags <chr> "[' Leisure trip ', ' Coupl…
## $ days_since_review <chr> "0 days", "0 days", "3 days…
## $ lat <dbl> 52.36058, 52.36058, 52.3605…
## $ lng <dbl> 4.915968, 4.915968, 4.91596…
# Define Region Vectors
northern <- c("United Kingdom", "Ireland", "Norway", "Sweden", "Denmark", "Finland")
southern <- c("Spain", "Italy", "Portugal", "Greece")
central <- c("Germany", "France", "Austria", "Switzerland", "Belgium",
"Netherlands", "Czech Republic", "Poland")
#clean and derive key fields
hotel_clean <- hotel_raw %>%
filter(!is.na(Review_Date), !is.na(Reviewer_Score)) %>%
mutate(
country = if_else(str_detect(Hotel_Address, "United Kingdom"),
"United Kingdom",
word(Hotel_Address, -1)),
city = case_when(
country == "United Kingdom" ~ "London",
TRUE ~ str_remove(word(Hotel_Address, -2), ",")
),
date = as.Date(Review_Date, format = "%m/%d/%Y"),
rating = Reviewer_Score,
text = paste(Positive_Review, Negative_Review)
) %>%
filter(text != "") %>%
mutate(review_id = row_number()) %>%
select(review_id, city, country, date, rating, text) %>%
# Create Season Variables
mutate(
month = month(date, label = TRUE, abbr = TRUE, locale = "C"),
season = case_when(
month(date) %in% c(12, 1, 2) ~ "Winter",
month(date) %in% c(3, 4, 5) ~ "Spring",
month(date) %in% c(6, 7, 8) ~ "Summer",
month(date) %in% c(9, 10, 11) ~ "Autumn"
),
season = factor(season, levels = c("Winter","Spring","Summer","Autumn")),
# Create Region Variable
region = case_when(
country %in% northern ~ "Northern Europe",
country %in% southern ~ "Southern Europe",
country %in% central ~ "Central Europe",
TRUE ~ "Other"
),
region = factor(region,
levels = c("Northern Europe", "Central Europe",
"Southern Europe", "Other"))
)
print(unique(hotel_clean$city))
## [1] "Amsterdam" "London" "Paris" "Barcelona" "Milan" "Vienna"
# 1. Tokenize (Unnest words)
tokens_for_sentiment <- hotel_clean %>%
select(review_id, text) %>%
unnest_tokens(word, text)
# 2. Calculate Sentiment Scores
bing_lex <- get_sentiments("bing")
review_sentiment <- tokens_for_sentiment %>%
inner_join(bing_lex, by = "word") %>%
count(review_id, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment_score = positive - negative)
## Warning in inner_join(., bing_lex, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5414736 of `x` matches multiple rows in `y`.
## ℹ Row 2736 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# 3. Merge back to main dataset
hotel_sent <- hotel_clean %>%
left_join(review_sentiment,
by = "review_id") %>%
mutate(
positive = replace_na(positive, 0),
negative = replace_na(negative, 0),
sentiment_score = replace_na(sentiment_score, 0)
)
hotel_sent %>%
select(review_id, city, date, rating, positive, negative, sentiment_score) %>%
slice_head(n=5)
## # A tibble: 5 × 7
## review_id city date rating positive negative sentiment_score
## <int> <chr> <date> <dbl> <int> <int> <int>
## 1 1 Amsterdam 2017-08-03 2.9 5 13 -8
## 2 2 Amsterdam 2017-08-03 7.5 8 5 3
## 3 3 Amsterdam 2017-07-31 7.1 4 1 3
## 4 4 Amsterdam 2017-07-31 3.8 11 15 -4
## 5 5 Amsterdam 2017-07-24 6.7 5 3 2
##Table 1: Overall Summary Statistics
table1 <- tibble(
Variable = c("Rating", "Positive", "Negative", "Sentiment Score"),
Mean = c(
mean(hotel_sent$rating, na.rm = TRUE),
mean(hotel_sent$positive, na.rm = TRUE),
mean(hotel_sent$negative, na.rm = TRUE),
mean(hotel_sent$sentiment_score, na.rm = TRUE)
),
SD = c(
sd(hotel_sent$rating, na.rm = TRUE),
sd(hotel_sent$positive, na.rm = TRUE),
sd(hotel_sent$negative, na.rm = TRUE),
sd(hotel_sent$sentiment_score, na.rm = TRUE)
),
Min = c(
min(hotel_sent$rating, na.rm = TRUE),
min(hotel_sent$positive, na.rm = TRUE),
min(hotel_sent$negative, na.rm = TRUE),
min(hotel_sent$sentiment_score, na.rm = TRUE)
),
Max = c(
max(hotel_sent$rating, na.rm = TRUE),
max(hotel_sent$positive, na.rm = TRUE),
max(hotel_sent$negative, na.rm = TRUE),
max(hotel_sent$sentiment_score, na.rm = TRUE)
)
)
library(gt)
## Warning: 程序包'gt'是用R版本4.4.3 来建造的
season_summary <- hotel_sent %>%
filter(!is.na(season)) %>%
group_by(season) %>%
summarise(
average_rating = round(mean(rating, na.rm = TRUE), 2),
average_sentiment = round(mean(sentiment_score, na.rm = TRUE), 2),
review_count = n(),
.groups = "drop"
)
season_summary %>%
gt() %>%
tab_header(
title = "Summary of Hotel Reviews by Season",
subtitle = "Average rating, sentiment score, and review count"
) %>%
cols_label(
season = "Season",
average_rating = "Avg Rating",
average_sentiment = "Avg Sentiment",
review_count = "Review Count"
)
| Summary of Hotel Reviews by Season | |||
| Average rating, sentiment score, and review count | |||
| Season | Avg Rating | Avg Sentiment | Review Count |
|---|---|---|---|
| Winter | 8.48 | 2.06 | 120128 |
| Spring | 8.43 | 2.21 | 130483 |
| Summer | 8.38 | 2.19 | 142886 |
| Autumn | 8.29 | 1.92 | 122241 |
##Table 2: Review Count by Season and Region
table2 <- hotel_sent %>%
count(season, region) %>%
pivot_wider(names_from = region, values_from = n, values_fill = 0) %>%
rename(Season = season)
table2 %>%
gt() %>%
tab_header(
title = "Table 2: Review Count by Season and Region",
subtitle = "Number of reviews across seasons and European regions"
) %>%
cols_label(
`Northern Europe` = "Northern",
`Central Europe` = "Central",
`Southern Europe` = "Southern"
) %>%
fmt_number(
columns = where(is.numeric),
use_seps = TRUE
) %>%
tab_options(
table.align = "center"
)
| Table 2: Review Count by Season and Region | |||
| Number of reviews across seasons and European regions | |||
| Season | Northern | Central | Southern |
|---|---|---|---|
| Winter | 67,346.00 | 35,146.00 | 17,636.00 |
| Spring | 67,420.00 | 38,369.00 | 24,694.00 |
| Summer | 66,961.00 | 45,112.00 | 30,813.00 |
| Autumn | 60,574.00 | 37,454.00 | 24,213.00 |
Sentiment Cycle: All four cities exhibit a clear annual sentiment cycle, peaking in late Spring (March–May) and reaching its nadir in late Autumn (October–November).
Peak Strain (Paris): Paris shows the most dramatic drop, hitting the lowest sentiment score among the group in October. This suggests high visitor stress during the late tourist season.
Relative Stability: Barcelona and Vienna generally maintain the highest mean sentiment scores throughout the year, suggesting comparative performance stability.
top_cities <- c("Vienna", "Paris", "Barcelona", "Amsterdam")
fig1_data <- hotel_sent %>%
filter(city %in% top_cities) %>%
group_by(city, month) %>%
summarise(
mean_sentiment = mean(sentiment_score, na.rm = TRUE),
n = n(),
.groups = "drop"
)
figure1 <- ggplot(fig1_data,
aes(x = month, y = mean_sentiment,
group = city, color = city)) +
geom_line(linewidth = 1) + # size is deprecated, updated to linewidth
geom_point(size = 2) +
labs(
title = "Monthly sentiment trends in selected European cities",
x = "Month",
y = "Average sentiment score",
color = "City"
) +
theme(panel.grid.minor = element_blank())
figure1
ggsave(figure1, filename = "images/figure1.png",
width = 6, height = 4, units = "in", bg = "transparent")
This chart shows how operational pressure systematically shifts throughout the year across the four selected European cities.
Summer Pressure (Noise): Noise complaints surge significantly in Summer and Autumn, especially in highly touristic cities like Barcelona. This reflects the strain of peak tourist activity.
Winter Pressure (Facilities): Temperature / Facility issues show the inverse pattern, rising sharply in Winter, particularly in Northern cities like London and Amsterdam. This points to heating and building maintenance as the primary winter challenge.
Service and Cleanliness complaints remain relatively consistent across all seasons, suggesting these issues are linked more to hotel standards than to external seasonal factors.
tokens_complaint <- hotel_clean %>%
select(city, country, region, date, month, season, rating, text) %>%
unnest_tokens(word, text)
# 1. Define Dictionary
complaint_dict <- tibble(
keyword = c("noisy","loud","noise",
"dirty","smell","stain",
"rude","unhelpful","slow",
"cold","hot","ac","aircon","heating"),
type = c(rep("Noise",3),
rep("Cleanliness",3),
rep("Service",3),
rep("Temperature / Facility",5))
)
# 2.Prepare Data for Plotting
fig2_data_optimized <- hotel_clean %>%
filter(city %in% top_cities, !is.na(season)) %>%
select(city, season, text) %>%
unnest_tokens(word, text) %>%
mutate(word = str_to_lower(word)) %>%
inner_join(complaint_dict, by = c("word" = "keyword")) %>%
count(city, season, type) %>%
group_by(city, season) %>%
mutate(pct = n / sum(n)) %>%
ungroup() %>%
mutate(
city = factor(city, levels = top_cities),
type = fct_relevel(type, "Noise", "Cleanliness", "Service", "Temperature / Facility")
)
#3. Plotting
library(scales)
figure2_stacked <- ggplot(fig2_data_optimized,
aes(x = season, y = pct, fill = type)) +
geom_col(position = "stack", width = 0.7) +
facet_wrap(~ city, nrow = 1) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Figure 2: Seasonal Composition of Complaint Types",
subtitle = paste("Analysis of complaint category proportions in:", paste(top_cities, collapse=", ")),
x = "",
y = "Share of Complaints",
fill = "Complaint Type"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.major.x = element_blank()
)
figure2_stacked
ggsave(figure2_stacked, filename = "images/figure2_stacked.png",
width = 10, height = 5, units = "in", bg = "white")
This boxplot compares how different European regions handle seasonal pressure.
Northern & Central Resilience: These regions demonstrate high stability. Their rating distributions remain consistent year-round, suggesting they are operationally resilient to seasonal changes.
Southern Volatility: In contrast, Southern Europe exhibits greater variability, particularly in Summer. While the median rating stays high, the distribution’s lower tail drops, indicating a higher likelihood of poor traveler experiences during the peak season.
fig3_data <- hotel_sent %>%
filter(!is.na(season), region != "Other")
figure3 <- ggplot(fig3_data,
aes(x = season, y = rating, fill = region)) +
geom_boxplot(outlier.alpha = 0.15) +
scale_y_continuous(limits = c(0, 10), breaks = seq(0, 10, 2)) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Distribution of Hotel Ratings by Season and Region",
x = "Season",
y = "Hotel Rating",
fill = "Region"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "bottom"
)
figure3
ggsave(figure3, filename = "images/figure3.png",
width = 6, height = 4, units = "in", bg = "transparent")
This analysis confirms that seasonality is a critical, predictable factor in the European hotel experience.
Operational Shifts: The main challenges for hotels shift with the calendar: Summer means dealing with noise, and Winter means fixing facilities and heating.
Regional Strength: Northern and Central Europe demonstrate the highest stability and service resilience throughout the year.
Actionable Insight: Hotels must proactively adjust their focus—like improving soundproofing before Summer and performing maintenance before Winter—to manage these predictable seasonal pressures.