rm(list = ls())
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
options(digits = 4, scipen = 999)

Expedia Hotel Booking Analysis: Does Price Competitiveness Affect Booking Likelihood Across Different Hotel Star Ratings?

1. Research Question

Does price competitiveness affect hotel booking likelihood, and does this effect differ across hotel star ratings?

1.1 Hypotheses

Hypothesis Statement
H1 Hotels with higher price competitiveness (cheaper than same-star average) have higher booking likelihood
H2 The effect of price competitiveness on booking likelihood differs across hotel star ratings

1.2 Definition of Price Competitiveness

In this analysis, price competitiveness is defined as:

Price Ratio=Current Price/Average Price of Same-Star Hotels

  • Price Ratio < 1: Cheaper than same-star average (competitive)

  • Price Ratio = 1: At same-star average price

  • Price Ratio > 1: More expensive than same-star average (uncompetitive)

This definition is more meaningful than comparing to a hotel’s own historical price because customers typically compare hotels within the same category.

2. Data Cleaning

# Load data
df_raw <- read_csv("data/expedia.csv")
Rows: 158269 Columns: 54
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (31): visitor_hist_starrating, visitor_hist_adr_usd, prop_review_score,...
dbl  (22): srch_id, site_id, visitor_location_country_id, prop_country_id, p...
date  (1): date_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Data cleaning
df_clean <- df_raw %>%
  mutate(
    # Dependent variable
    booking = booking_bool,
    
    # Star rating categories
    star_category = case_when(
      prop_starrating <= 2 ~ "Budget (1-2★)",
      prop_starrating == 3 ~ "Mid (3★)",
      prop_starrating >= 4 ~ "Luxury (4-5★)",
      TRUE ~ "Unknown"
    ),
    
    # Handle missing values
    visitor_hist_starrating = ifelse(is.na(visitor_hist_starrating), 0, visitor_hist_starrating),
    visitor_hist_adr_usd = ifelse(is.na(visitor_hist_adr_usd), 0, visitor_hist_adr_usd),
    prop_review_score = ifelse(is.na(prop_review_score), 0, prop_review_score),
    prop_location_score1 = ifelse(is.na(prop_location_score1), 0, prop_location_score1)
  ) %>%
  filter(prop_starrating > 0) %>%
  filter(!is.na(position)) %>%
  filter(!is.na(price_usd))

# Calculate price competitiveness by star rating
df_clean <- df_clean %>%
  group_by(prop_starrating) %>%
  mutate(
    avg_price_by_star = mean(price_usd, na.rm = TRUE),
    price_ratio = price_usd / avg_price_by_star
  ) %>%
  ungroup() %>%
  filter(price_ratio > 0.3 & price_ratio < 3.0)

# Create price competitiveness groups
df_clean <- df_clean %>%
  mutate(
    price_comp_group = case_when(
      price_ratio < 0.7 ~ "Very Competitive",
      price_ratio >= 0.7 & price_ratio < 0.9 ~ "Competitive",
      price_ratio >= 0.9 & price_ratio < 1.1 ~ "At Market Price",
      price_ratio >= 1.1 & price_ratio < 1.3 ~ "Uncompetitive",
      price_ratio >= 1.3 ~ "Very Uncompetitive",
      TRUE ~ "Unknown"
    )
  )

cat("Cleaned data rows:", nrow(df_clean))
Cleaned data rows: 150885

3. Descriptive Analysis

3.1 Booking Rate by Star Rating

df_clean %>%
  group_by(star_category) %>%
  summarise(
    Count = n(),
    Booking_Rate = paste0(round(mean(booking) * 100, 1), "%"),
    Avg_Price_Ratio = round(mean(price_ratio, na.rm = TRUE), 2)
  ) %>%
  knitr::kable(caption = "Table 1: Booking Rate by Hotel Star Rating")
Table 1: Booking Rate by Hotel Star Rating
star_category Count Booking_Rate Avg_Price_Ratio
Budget (1-2★) 41800 2.5% 0.91
Luxury (4-5★) 40116 3.2% 0.98
Mid (3★) 68969 3% 0.97

Key observation: Luxury hotels have the highest booking rate (3.2%), while Budget hotels have the lowest (2.5%). All star categories have average price ratios below 1, indicating that Expedia generally displays competitive prices.

3.2 Booking Rate by Price Competitiveness

df_clean %>%
  group_by(price_comp_group) %>%
  summarise(
    Count = n(),
    Booking_Rate = paste0(round(mean(booking) * 100, 1), "%"),
    Avg_Star_Rating = round(mean(prop_starrating, na.rm = TRUE), 1)
  ) %>%
  arrange(desc(Booking_Rate)) %>%
  knitr::kable(caption = "Table 2: Booking Rate by Price Competitiveness")
Table 2: Booking Rate by Price Competitiveness
price_comp_group Count Booking_Rate Avg_Star_Rating
Very Competitive 44024 3.9% 3.0
Competitive 37775 3.1% 3.0
At Market Price 26913 2.8% 3.0
Uncompetitive 16372 2.1% 3.1
Very Uncompetitive 25801 1.5% 3.1

Key observation: Very Competitive hotels have a booking rate of 3.9%, which is 2.6 times higher than Very Uncompetitive hotels (1.5%). This strongly supports H1.

3.3 Cross-Tabulation: Star Rating × Price Competitiveness

cross_table <- df_clean %>%
  group_by(star_category, price_comp_group) %>%
  summarise(
    Booking_Rate = round(mean(booking) * 100, 1),
    .groups = "drop"
  ) %>%
  pivot_wider(
    id_cols = star_category,
    names_from = price_comp_group,
    values_from = Booking_Rate,
    values_fill = NA
  )

knitr::kable(cross_table, caption = "Table 3: Booking Rate (%) by Star Rating and Price Competitiveness")
Table 3: Booking Rate (%) by Star Rating and Price Competitiveness
star_category At Market Price Competitive Uncompetitive Very Competitive Very Uncompetitive
Budget (1-2★) 2.7 2.5 2.1 2.8 1.7
Luxury (4-5★) 2.9 3.4 2.3 4.9 1.3
Mid (3★) 2.8 3.2 1.9 4.3 1.6

Key observation: Within each star category, booking rate increases as price competitiveness improves. However, the magnitude varies by star rating.

3.4 Price Sensitivity Analysis

sensitivity_table <- df_clean %>%
  group_by(star_category) %>%
  summarise(
    Booking_Rate_Cheaper = round(mean(booking[price_ratio < 0.9]) * 100, 1),
    Booking_Rate_Expensive = round(mean(booking[price_ratio > 1.1]) * 100, 1),
    Difference = Booking_Rate_Cheaper - Booking_Rate_Expensive,
    .groups = "drop"
  )

knitr::kable(sensitivity_table, caption = "Table 4: Price Sensitivity by Star Rating")
Table 4: Price Sensitivity by Star Rating
star_category Booking_Rate_Cheaper Booking_Rate_Expensive Difference
Budget (1-2★) 2.6 1.8 0.8
Luxury (4-5★) 4.3 1.7 2.6
Mid (3★) 3.7 1.8 1.9

Key observation: Price sensitivity ranking: Luxury (+2.6 pp) > Mid (+1.9 pp) > Budget (+0.8 pp). This contradicts the intuition that budget travelers are most price-sensitive.

4. Visualization

4.1 Price Competitiveness vs Booking Rate by Star Rating

plot1_data <- df_clean %>%
  filter(star_category != "Unknown") %>%
  mutate(price_bin = cut(price_ratio, 
                         breaks = seq(0.4, 2.2, 0.2),
                         labels = c("0.4-0.6", "0.6-0.8", "0.8-1.0", 
                                    "1.0-1.2", "1.2-1.4", "1.4-1.6", 
                                    "1.6-1.8", "1.8-2.0", "2.0-2.2"))) %>%
  group_by(star_category, price_bin) %>%
  summarise(
    Booking_Rate = mean(booking) * 100,
    n = n(),
    .groups = "drop"
  ) %>%
  filter(n >= 30)

ggplot(plot1_data, 
       aes(x = price_bin, y = Booking_Rate, 
           color = star_category, group = star_category)) +
  geom_line(size = 1.2) +
  geom_point(size = 2.5) +
  geom_hline(yintercept = mean(df_clean$booking) * 100, 
             linetype = "dashed", color = "gray50") +
  labs(
    title = "Price Competitiveness vs Booking Rate by Hotel Star Rating",
    subtitle = "Price ratio < 1 indicates cheaper than same-star average",
    x = "Price Ratio (Current Price / Same-Star Average)",
    y = "Booking Rate (%)",
    color = "Star Category"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5)
  )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Figure 1 shows that for all star categories, booking rates decrease as prices become less competitive. The effect is strongest for Luxury hotels.

4.2 Boxplot: Booked vs Not Booked

ggplot(df_clean, aes(x = ifelse(booking == 1, "Booked", "Not Booked"), 
                     y = price_ratio, fill = star_category)) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.5) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "darkred", size = 1) +
  labs(
    title = "Price Competitiveness Distribution: Booked vs Not Booked",
    subtitle = "Red dashed line = same-star average price (ratio = 1)",
    x = "Booking Outcome",
    y = "Price Ratio (Lower = More Competitive)",
    fill = "Star Category"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(face = "bold", size = 14, hjust = 0.5))

Figure 2 shows that booked hotels have significantly lower price ratios (more competitive) than non-booked hotels.

4.3 Price Sensitivity by Star Rating

sensitivity_table %>%
  filter(star_category != "Unknown") %>%
  ggplot(aes(x = star_category, y = Difference, fill = star_category)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  geom_text(aes(label = paste0(Difference, " pp")), vjust = -0.5, size = 4) +
  labs(
    title = "Price Sensitivity by Hotel Star Rating",
    subtitle = "Difference = Booking rate (cheaper) - Booking rate (expensive)",
    x = "Star Category",
    y = "Price Sensitivity (percentage points)"
  ) +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", size = 14, hjust = 0.5))

Figure 3 clearly shows that Luxury hotels have the highest price sensitivity, followed by Mid hotels, with Budget hotels being the least price-sensitive.

5. Conclusion

5.1 Summary of Findings

Finding Evidence Supports
More competitive pricing increases booking likelihood Table 2, Figure 1, Figure 2 H1: Yes
Price sensitivity differs across star ratings Table 4, Figure 3 H2: Yes
Luxury hotels are most price-sensitive Table 4, Figure 3 Surprising finding
Budget hotels are least price-sensitive Table 4, Figure 3 Surprising finding

5.2 Answer to Research Questions

Q1: Does price competitiveness affect hotel booking likelihood?

Yes. Hotels with more competitive pricing (lower price ratio) have significantly higher booking probabilities. Very Competitive hotels have a 3.9% booking rate, compared to only 1.5% for Very Uncompetitive hotels — a 2.6x difference.

Q2: Does this effect differ across hotel star ratings?

Yes. The price sensitivity ranking is:

Rank Star Category Price Sensitivity
1 Luxury (4-5★) +2.6 pp
2 Mid (3★) +1.9 pp
3 Budget (1-2★) +0.8 pp

5.3 Possible Explanation for Counter-Intuitive Finding

Why are Luxury hotels most price-sensitive?

Factor Explanation
Absolute savings A 10% discount on a 500roomsaves500roomsaves50, while the same discount on a 100roomsavesonly100roomsavesonly10
More alternatives Luxury travelers have more comparable options to choose from
Cross-platform comparison Higher-end travelers are more likely to compare prices across platforms

5.4 Business Recommendations for Expedia

  1. Luxury hotels: Price promotions are most effective. Prioritize displaying luxury hotels with competitive prices.

  2. Mid hotels: Price competitiveness is key for conversion. Maintain competitive pricing for this segment.

  3. Budget hotels: Price beyond a certain point has diminishing returns. Focus on other factors like reviews and location.

6. Limitations

  1. Price competitiveness definition: Uses same-star average as benchmark; ideal measure would include competitor prices

  2. Correlation vs causation: Observational data cannot prove causality

  3. Sample period: Data may not reflect current post-pandemic travel behavior

7. References

  • Expedia ICDM 2013 Dataset Documentation