Executive summary

This project analyzes 515,000 hotel reviews to investigate the influence of seasonal cycles on the European traveler experience.

Research Goal: To quantify seasonal effects on sentiment/complaints and compare patterns across European regions.

Finding 1 (Sentiment Cycle): Sentiment scores for major cities follow a distinct seasonal cycle, peaking in Spring and declining significantly during late summer, suggesting city-specific peak season strain.

Finding 2 (Operational Shift):Operational pressure systematically shifts with the calendar: Noise complaints surge in summer, establishing a key summer challenge, while temperature/facility issues dominate the winter season.

Finding 3 (Regions): Northern/Central Europe show high stability; Southern Europe exhibits greater seasonal rating volatility.

Conclusion: Seasonality is a critical factor influencing both the emotional quality and practical challenges of the tourist experience.

Research Questions

This project is guided by two core research questions aimed at understanding the systematic impact of seasonal cycles on the quality and nature of the European hotel guest experience.

How do seasonal changes affect hotel review sentiments and complaint types, and what “seasonal experience identities” emerge by city? (Addressed by Figure 1 and Figure 2)
Which forms of dissatisfaction (noise, cleanliness, service, facility issues) vary systematically across northern, southern, and central Europe?(Addressed by Figure 2 and Figure 3)

Data background

The analysis uses the 515K European Hotel Reviews dataset.

Scope: Contains 515,000 customer reviews and scores for 1,493 luxury hotels across Europe.

Suitability: Ideal for seasonal study due to the inclusion of Review Date and rich review text fields.

Source: Data was scraped from Booking.com (publicly available).

Data cleaning

# 1. Load Libraries
library(tidyverse)
library(lubridate)
library(tidytext)
library(stringr)
library(forcats)
library(scales) 

theme_set(theme_minimal())
set.seed(1234)

# Create images folder if it doesn't exist (prevents ggsave errors)
if(!dir.exists("images")) dir.create("images")

# 2. Load Data
hotel_raw <- readr::read_csv("Hotel-Reviews.csv")
glimpse(hotel_raw)

## Rows: 515,738
## Columns: 17
## $ Hotel_Address                              <chr> "s Gravesandestraat 55 Oost…
## $ Additional_Number_of_Scoring               <dbl> 194, 194, 194, 194, 194, 19…
## $ Review_Date                                <chr> "8/3/2017", "8/3/2017", "7/…
## $ Average_Score                              <dbl> 7.7, 7.7, 7.7, 7.7, 7.7, 7.…
## $ Hotel_Name                                 <chr> "Hotel Arena", "Hotel Arena…
## $ Reviewer_Nationality                       <chr> "Russia", "Ireland", "Austr…
## $ Negative_Review                            <chr> "I am so angry that i made …
## $ Review_Total_Negative_Word_Counts          <dbl> 397, 0, 42, 210, 140, 17, 3…
## $ Total_Number_of_Reviews                    <dbl> 1403, 1403, 1403, 1403, 140…
## $ Positive_Review                            <chr> "Only the park outside of t…
## $ Review_Total_Positive_Word_Counts          <dbl> 11, 105, 21, 26, 8, 20, 18,…
## $ Total_Number_of_Reviews_Reviewer_Has_Given <dbl> 7, 7, 9, 1, 3, 1, 6, 1, 3, …
## $ Reviewer_Score                             <dbl> 2.9, 7.5, 7.1, 3.8, 6.7, 6.…
## $ Tags                                       <chr> "[' Leisure trip ', ' Coupl…
## $ days_since_review                          <chr> "0 days", "0 days", "3 days…
## $ lat                                        <dbl> 52.36058, 52.36058, 52.3605…
## $ lng                                        <dbl> 4.915968, 4.915968, 4.91596…

Preprocessing: Locations, Dates, and Regions

# Define Region Vectors
northern <- c("United Kingdom", "Ireland", "Norway", "Sweden", "Denmark", "Finland")
southern <- c("Spain", "Italy", "Portugal", "Greece")
central  <- c("Germany", "France", "Austria", "Switzerland", "Belgium",
              "Netherlands", "Czech Republic", "Poland")

#clean and derive key fields
hotel_clean <- hotel_raw %>%
  filter(!is.na(Review_Date), !is.na(Reviewer_Score)) %>%
  mutate(
    country = if_else(str_detect(Hotel_Address, "United Kingdom"),
                      "United Kingdom",
                      word(Hotel_Address, -1)),
    
    city = case_when(
      country == "United Kingdom" ~ "London",
      TRUE ~ str_remove(word(Hotel_Address, -2), ",")
    ),
    
    date = as.Date(Review_Date, format = "%m/%d/%Y"),
    rating = Reviewer_Score,
    text = paste(Positive_Review, Negative_Review)
  ) %>%
      filter(text != "") %>%
      
  mutate(review_id = row_number()) %>%
  select(review_id, city, country, date, rating, text) %>%
  
    # Create Season Variables
mutate(
    month = month(date, label = TRUE, abbr = TRUE, locale = "C"),
    season = case_when(
      month(date) %in% c(12, 1, 2)  ~ "Winter",
      month(date) %in% c(3, 4, 5)   ~ "Spring",
      month(date) %in% c(6, 7, 8)   ~ "Summer",
      month(date) %in% c(9, 10, 11) ~ "Autumn"
    ),
    season = factor(season, levels = c("Winter","Spring","Summer","Autumn")),
    
# Create Region Variable
region = case_when(
      country %in% northern ~ "Northern Europe",
      country %in% southern ~ "Southern Europe",
      country %in% central  ~ "Central Europe",
      TRUE ~ "Other"
    ),
    region = factor(region, 
                    levels = c("Northern Europe", "Central Europe", 
                               "Southern Europe", "Other"))
  )

print(unique(hotel_clean$city))

## [1] "Amsterdam" "London"    "Paris"     "Barcelona" "Milan"     "Vienna"

Text Processing and Sentiment Calculation

# 1. Tokenize (Unnest words)
tokens_for_sentiment <- hotel_clean %>%  
  select(review_id, text) %>%
  unnest_tokens(word, text)

# 2. Calculate Sentiment Scores
bing_lex <- get_sentiments("bing")

review_sentiment <- tokens_for_sentiment %>% 
  inner_join(bing_lex, by = "word") %>%
  count(review_id, sentiment) %>%
  tidyr::pivot_wider(names_from = sentiment,
                     values_from = n,
                     values_fill = 0) %>%
  mutate(sentiment_score = positive - negative)

## Warning in inner_join(., bing_lex, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5414736 of `x` matches multiple rows in `y`.
## ℹ Row 2736 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# 3. Merge back to main dataset
hotel_sent <- hotel_clean %>%
  left_join(review_sentiment,
            by = "review_id") %>%
  mutate(
    positive = replace_na(positive, 0),
    negative = replace_na(negative, 0),
    sentiment_score = replace_na(sentiment_score, 0)
  )

hotel_sent %>% 
  select(review_id, city, date, rating, positive, negative, sentiment_score) %>% 
  slice_head(n=5)

## # A tibble: 5 × 7
##   review_id city      date       rating positive negative sentiment_score
##       <int> <chr>     <date>      <dbl>    <int>    <int>           <int>
## 1         1 Amsterdam 2017-08-03    2.9        5       13              -8
## 2         2 Amsterdam 2017-08-03    7.5        8        5               3
## 3         3 Amsterdam 2017-07-31    7.1        4        1               3
## 4         4 Amsterdam 2017-07-31    3.8       11       15              -4
## 5         5 Amsterdam 2017-07-24    6.7        5        3               2

##Table 1: Overall Summary Statistics

table1 <- tibble(
  Variable = c("Rating", "Positive", "Negative", "Sentiment Score"),
  Mean = c(
    mean(hotel_sent$rating, na.rm = TRUE),
    mean(hotel_sent$positive, na.rm = TRUE),
    mean(hotel_sent$negative, na.rm = TRUE),
    mean(hotel_sent$sentiment_score, na.rm = TRUE)
  ),
  
  SD = c(
    sd(hotel_sent$rating, na.rm = TRUE),
    sd(hotel_sent$positive, na.rm = TRUE),
    sd(hotel_sent$negative, na.rm = TRUE),
    sd(hotel_sent$sentiment_score, na.rm = TRUE)
  ),
  
  Min = c(
    min(hotel_sent$rating, na.rm = TRUE),
    min(hotel_sent$positive, na.rm = TRUE),
    min(hotel_sent$negative, na.rm = TRUE),
    min(hotel_sent$sentiment_score, na.rm = TRUE)
  ),
  
  Max = c(
    max(hotel_sent$rating, na.rm = TRUE),
    max(hotel_sent$positive, na.rm = TRUE),
    max(hotel_sent$negative, na.rm = TRUE),
    max(hotel_sent$sentiment_score, na.rm = TRUE)
  )
)

library(gt)

## Warning: 程序包'gt'是用R版本4.4.3 来建造的

season_summary <- hotel_sent %>%
  filter(!is.na(season)) %>%
  group_by(season) %>%
  summarise(
    average_rating = round(mean(rating, na.rm = TRUE), 2),
    average_sentiment = round(mean(sentiment_score, na.rm = TRUE), 2),
    review_count = n(),
    .groups = "drop"
  )

season_summary %>%
  gt() %>%
  tab_header(
    title = "Summary of Hotel Reviews by Season",
    subtitle = "Average rating, sentiment score, and review count"
  ) %>%
  cols_label(
    season = "Season",
    average_rating = "Avg Rating",
    average_sentiment = "Avg Sentiment",
    review_count = "Review Count"
  )

Season	Avg Rating	Avg Sentiment	Review Count
Summary of Hotel Reviews by Season
Average rating, sentiment score, and review count
Winter	8.48	2.06	120128
Spring	8.43	2.21	130483
Summer	8.38	2.19	142886
Autumn	8.29	1.92	122241

##Table 2: Review Count by Season and Region

table2 <- hotel_sent %>%
  count(season, region) %>%
  pivot_wider(names_from = region, values_from = n, values_fill = 0) %>%
  rename(Season = season)

table2 %>%
  gt() %>%
  tab_header(
    title = "Table 2: Review Count by Season and Region",
    subtitle = "Number of reviews across seasons and European regions"
  ) %>%
  cols_label(
    `Northern Europe` = "Northern",
    `Central Europe` = "Central",
    `Southern Europe` = "Southern"
  ) %>%
  fmt_number(
    columns = where(is.numeric),
    use_seps = TRUE
  ) %>%
  tab_options(
    table.align = "center"
  )

Season	Northern	Central	Southern
Table 2: Review Count by Season and Region
Number of reviews across seasons and European regions
Winter	67,346.00	35,146.00	17,636.00
Spring	67,420.00	38,369.00	24,694.00
Summer	66,961.00	45,112.00	30,813.00
Autumn	60,574.00	37,454.00	24,213.00

Individual figures

Figure 1: Monthly Cycle of Traveler Sentiment in Selected Cities

Sentiment Cycle: All four cities exhibit a clear annual sentiment cycle, peaking in late Spring (March–May) and reaching its nadir in late Autumn (October–November).

Peak Strain (Paris): Paris shows the most dramatic drop, hitting the lowest sentiment score among the group in October. This suggests high visitor stress during the late tourist season.

Relative Stability: Barcelona and Vienna generally maintain the highest mean sentiment scores throughout the year, suggesting comparative performance stability.

top_cities <- c("Vienna", "Paris", "Barcelona", "Amsterdam")

fig1_data <- hotel_sent %>%
  filter(city %in% top_cities) %>%
  group_by(city, month) %>%
  summarise(
    mean_sentiment = mean(sentiment_score, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

figure1 <- ggplot(fig1_data,
                  aes(x = month, y = mean_sentiment,
                      group = city, color = city)) +
  geom_line(linewidth = 1) +  # size is deprecated, updated to linewidth
  geom_point(size = 2) +
  labs(
    title = "Monthly sentiment trends in selected European cities",
    x = "Month",
    y = "Average sentiment score",
    color = "City"
  ) +
  theme(panel.grid.minor = element_blank())

figure1

ggsave(figure1, filename = "images/figure1.png",
width = 6, height = 4, units = "in", bg = "transparent")

Figure 2:Seasonal Composition of Complaint Types

This chart shows how operational pressure systematically shifts throughout the year across the four selected European cities.

Summer Pressure (Noise): Noise complaints surge significantly in Summer and Autumn, especially in highly touristic cities like Barcelona. This reflects the strain of peak tourist activity.

Winter Pressure (Facilities): Temperature / Facility issues show the inverse pattern, rising sharply in Winter, particularly in Northern cities like London and Amsterdam. This points to heating and building maintenance as the primary winter challenge.

Service and Cleanliness complaints remain relatively consistent across all seasons, suggesting these issues are linked more to hotel standards than to external seasonal factors.

tokens_complaint <- hotel_clean %>%
  select(city, country, region, date, month, season, rating, text) %>%
  unnest_tokens(word, text)

# 1. Define Dictionary
complaint_dict <- tibble(
  keyword = c("noisy","loud","noise",
              "dirty","smell","stain",
              "rude","unhelpful","slow",
              "cold","hot","ac","aircon","heating"),
  type = c(rep("Noise",3),
           rep("Cleanliness",3),
           rep("Service",3),
           rep("Temperature / Facility",5))
)

# 2.Prepare Data for Plotting

fig2_data_optimized <- hotel_clean %>%
  filter(city %in% top_cities, !is.na(season)) %>%
  select(city, season, text) %>%

  unnest_tokens(word, text) %>%
  mutate(word = str_to_lower(word)) %>%
  inner_join(complaint_dict, by = c("word" = "keyword")) %>%
  
  count(city, season, type) %>% 
  group_by(city, season) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%

  mutate(
    city = factor(city, levels = top_cities), 
    type = fct_relevel(type, "Noise", "Cleanliness", "Service", "Temperature / Facility")
  )

#3. Plotting
library(scales) 

figure2_stacked <- ggplot(fig2_data_optimized, 
       aes(x = season, y = pct, fill = type)) +
  geom_col(position = "stack", width = 0.7) + 
  facet_wrap(~ city, nrow = 1) + 
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  scale_fill_brewer(palette = "Set2") + 
  
  labs(
    title = "Figure 2: Seasonal Composition of Complaint Types",
    subtitle = paste("Analysis of complaint category proportions in:", paste(top_cities, collapse=", ")),
    x = "", 
    y = "Share of Complaints",
    fill = "Complaint Type"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.x = element_blank()
  )

figure2_stacked

ggsave(figure2_stacked, filename = "images/figure2_stacked.png",
       width = 10, height = 5, units = "in", bg = "white")

Figure 3: Regional Rating Stability vs. Volatility

This boxplot compares how different European regions handle seasonal pressure.

Northern & Central Resilience: These regions demonstrate high stability. Their rating distributions remain consistent year-round, suggesting they are operationally resilient to seasonal changes.

Southern Volatility: In contrast, Southern Europe exhibits greater variability, particularly in Summer. While the median rating stays high, the distribution’s lower tail drops, indicating a higher likelihood of poor traveler experiences during the peak season.

fig3_data <- hotel_sent %>%
  filter(!is.na(season), region != "Other")

figure3 <- ggplot(fig3_data,
                  aes(x = season, y = rating, fill = region)) +
  geom_boxplot(outlier.alpha = 0.15) +
  scale_y_continuous(limits = c(0, 10), breaks = seq(0, 10, 2)) +  
  scale_fill_brewer(palette = "Set2") + 
  labs(
    title = "Distribution of Hotel Ratings by Season and Region",
    x = "Season",
    y = "Hotel Rating",
    fill = "Region"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position = "bottom"
  )

figure3

ggsave(figure3, filename = "images/figure3.png",
width = 6, height = 4, units = "in", bg = "transparent")

Conclusion

This analysis confirms that seasonality is a critical, predictable factor in the European hotel experience.

Operational Shifts: The main challenges for hotels shift with the calendar: Summer means dealing with noise, and Winter means fixing facilities and heating.

Regional Strength: Northern and Central Europe demonstrate the highest stability and service resilience throughout the year.

Actionable Insight: Hotels must proactively adjust their focus—like improving soundproofing before Summer and performing maintenance before Winter—to manage these predictable seasonal pressures.

Seasonal Experience Identity of European Cities

진일가 (20221405)

December 9, 2025