The analysis of this project is based on a dataset obtained from Kaggle (https://www.kaggle.com/datasets/mojtaba142/hotel-booking/data). It’s a dataset of 119390 hotel booking records. The following is a snapshot of the data.
## # A tibble: 6 × 38
## hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
## <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Resort… 0 342 2015 July 27 1 0 0 2
## 2 Resort… 0 737 2015 July 27 1 0 0 2
## 3 Resort… 0 7 2015 July 27 1 0 1 1
## 4 Resort… 0 13 2015 July 27 1 0 1 1
## 5 Resort… 0 14 2015 July 27 1 0 2 2
## 6 Resort… 0 14 2015 July 27 1 0 2 2
## # … with 28 more variables: children <dbl>, babies <dbl>, meal <fct>,
## # country <fct>, market_segment <fct>, distribution_channel <fct>,
## # is_repeated_guest <dbl>, previous_cancellations <dbl>,
## # previous_bookings_not_canceled <dbl>, reserved_room_type <fct>,
## # assigned_room_type <fct>, booking_changes <dbl>, deposit_type <fct>,
## # agent <dbl>, company <fct>, days_in_waiting_list <dbl>,
## # customer_type <fct>, adr <dbl>, required_car_parking_spaces <dbl>, …
In travel industry, it’s very important to understand where the customers come from, their booking behavior, through which travel agencies they book the hotel, how early do they book the hotel etc.. Business can adjust their business strategies accordingly. For example, their inventory strategy, pricing strategy and promotion strategy. The analysis below strives to slice and dice data from 7 different angles to discover the areas of interest and deep dive in select topics.
The first chart renders the booking trend by arrival date. Confirmed bookings and cancelled bookings are separated so that the audience can clearly see the trend. Since both confirmed bookings and cancelled bookings could have business implications (e.g., inventory, cancellation policy etc.), it’s important to monitor the trend regularly and make necessary business adjustments.
We can see that
As we can see in the first chart, booking cancellation is pretty significant across the year and it varies a lot. As a business, we want to understand which market segment caused the most cancellations so that we could revisit GTM strategies as well as pricing strategies. We can easily observe from the chart below that Online TA (Travel Agency)’s cancellation proportion more than doubled in two years timeframe (from 29% in July 2015 all the way to 82% in August 2017). Further analysis needs to be done against this market segement to find out the cause.
The chart below depicts which Distribution Channel/Market Segment combinations have the highest average daily price (over two years time frame). As we can see, the following combinations are hero
Starting from 2016, Aviation has becoming a strong performing market segment
You can interact with the chart below by zoom in/out or mouse over the dots. Give it a try.
Price is not the only KPI, but volume as well. We want to know which market segment/distribution channel excel on both fronts. In the diagram below, segment/channel combination is ranked by price on the left, volume on the right. You can tell that even though some pairs’ price is ideal, but their transaction volume is quite low (e.g., Offline TA/TO | GDS). Some pair ranks pretty high in both measures (e.g., Direct|Direct). The third category would be fair on price but leading in volume (Online TA | TA/TO ). Based on this finding, business can optimize channel strategy and give more emphasis to leading pairs.
You can interact with the chart below by zoom in/out or mouse over the dots. Give it a try.
From business’s point of view, maximizing revenue is a goal. Since hotel inventory is fairly fixed, how to give more inventory to those performing companies is a key business question. The first chart below is a scatter plot of # of booking and average daily price. We can see that the variation is quite big. We form the hypothesis that a good performing company should satisfy the following criteria:
We use the criteria to single out some companies, highlighted by red circle in the chart below.
Using the criteria, we picked ten companies. The chart below depicts their average adr over time. It’s quite obvious that most companies’ adr fluctuate over time. However, company #40’s performance is very stable. Based on the finding, the business can come up with a strategy to maximize the revenue by giving different companies different inventory or different incentives.
From business’s perpective, it’s also interesting to find out if there’s any correlation among different parameters, even though correlation doesn’t imply causation. The chart below depicts the correlation among all numeric fields in the booking data. As you can tell, there’s no strong correlation (correlation coefficient greater than 70%) between any two variables. The only visible positive correlation is between weekday stays and weekend stays. While arrival week number seems to have noticable negative correlation with arrival year. However, it was caused by the starting and end point of the dataset itself, thus no business significance.
We want to know how advanced do people book their hotel prior to their arrival. The first chart shows the average lead time (the difference between booking date and actual arrival date, for confirmed booking only) comparison between global and top 6 countries (in terms of # of bookings).
** Observations:**
The spread of lead time varies a lot country by country. Again, you can tell from the chart below that lead time for people with Germany and Britain origin varies a lot on any given month, comparing with that of other countries. Peole from Spain tends to have shorter lead time and narrower lead time span, indicating that they tend to have more ad hoc travels. Business can create different marketing strategies based on this finding to change people’s behavior based on thier origin.
booking<-booking %>% mutate(assign_match_reserve = ifelse(as.character(reserved_room_type)==as.character(assigned_room_type), "Yes", "No"), room_change_direction = ifelse(as.character(reserved_room_type)!=as.character(assigned_room_type), paste0(reserved_room_type," to ",assigned_room_type), NA))
room_change_stat<-booking %>% filter(!is.na(room_change_direction)) %>% group_by(hotel) %>% count(reserved_room_type,assigned_room_type) %>% mutate(pct = n/sum(n))
roommatch_by_month<-booking %>% group_by(arrival_month, hotel) %>% summarize(total = n(), room_match = sum(assign_match_reserve == "Yes"), room_change = sum(assign_match_reserve == "No")) %>% mutate (pct_match = room_match/total, pct_change = room_change/total)
## `summarise()` has grouped output by 'arrival_month'. You can override using the
## `.groups` argument.
ggplot(roommatch_by_month, aes(arrival_month, pct_match, color = hotel)) +
geom_line()
ggplot(room_change_stat , aes(reserved_room_type,assigned_room_type)) +
geom_raster(aes(fill = pct)) +
scale_fill_continuous(high = "darkred", low = "grey") +
facet_wrap(~hotel)
The analysis above touched on trending analysis, behavior analysis and performance analysis. More in-depth analysis could be done towards finding out more insights. For example, if the dataset starts in January and ends in December in all three years, then accumulated booking can be compared across years to get a sense of the economy. If in certain year travel industry gets hit by whatever factor, the accumulated booking by months can be different, even though the month by month pattern is the same. Those more advanced analysis is outside the scope of the project.