Data Visualization Capstone Project

Introduction

The analysis of this project is based on a dataset obtained from Kaggle (https://www.kaggle.com/datasets/mojtaba142/hotel-booking/data). It’s a dataset of 119390 hotel booking records. The following is a snapshot of the data.

## # A tibble: 6 × 38
##   hotel   is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
##   <fct>     <dbl>   <dbl>   <dbl> <fct>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
## 1 Resort…       0     342    2015 July         27       1       0       0      2
## 2 Resort…       0     737    2015 July         27       1       0       0      2
## 3 Resort…       0       7    2015 July         27       1       0       1      1
## 4 Resort…       0      13    2015 July         27       1       0       1      1
## 5 Resort…       0      14    2015 July         27       1       0       2      2
## 6 Resort…       0      14    2015 July         27       1       0       2      2
## # … with 28 more variables: children <dbl>, babies <dbl>, meal <fct>,
## #   country <fct>, market_segment <fct>, distribution_channel <fct>,
## #   is_repeated_guest <dbl>, previous_cancellations <dbl>,
## #   previous_bookings_not_canceled <dbl>, reserved_room_type <fct>,
## #   assigned_room_type <fct>, booking_changes <dbl>, deposit_type <fct>,
## #   agent <dbl>, company <fct>, days_in_waiting_list <dbl>,
## #   customer_type <fct>, adr <dbl>, required_car_parking_spaces <dbl>, …

In travel industry, it’s very important to understand where the customers come from, their booking behavior, through which travel agencies they book the hotel, how early do they book the hotel etc.. Business can adjust their business strategies accordingly. For example, their inventory strategy, pricing strategy and promotion strategy. The analysis below strives to slice and dice data from 7 different angles to discover the areas of interest and deep dive in select topics.

What’s the trend of booking?
Who cancelled the booking?
Which distribution channel works the best for which market segment?
Market sgement/distribution channel: take volume into consideration
Which agent companies are top performers?
Is there any correlation among different factors of booking?
How early do people book their hotel in advance?

1. Understand the basics

The first chart renders the booking trend by arrival date. Confirmed bookings and cancelled bookings are separated so that the audience can clearly see the trend. Since both confirmed bookings and cancelled bookings could have business implications (e.g., inventory, cancellation policy etc.), it’s important to monitor the trend regularly and make necessary business adjustments.

We can see that

Booking has seasonal patterns as we could expect
Though booking reached the highest level in Oct 2016 and May 2017, so does cancelled bookings. Confirmed bookings kept flat compartively.

2. Zoom into cancelled bookings

As we can see in the first chart, booking cancellation is pretty significant across the year and it varies a lot. As a business, we want to understand which market segment caused the most cancellations so that we could revisit GTM strategies as well as pricing strategies. We can easily observe from the chart below that Online TA (Travel Agency)’s cancellation proportion more than doubled in two years timeframe (from 29% in July 2015 all the way to 82% in August 2017). Further analysis needs to be done against this market segement to find out the cause.

3. Which distribution channel works the best for which market segment

The chart below depicts which Distribution Channel/Market Segment combinations have the highest average daily price (over two years time frame). As we can see, the following combinations are hero

Online TA | GDS
Online TA | TA/TO
Online TA | Corporate
Direct | Direct
Direct | TA/TO

Starting from 2016, Aviation has becoming a strong performing market segment

You can interact with the chart below by zoom in/out or mouse over the dots. Give it a try.

4. Market sgement/distribution channel pairs performance ranking (volume wise and dollar amount wise)

Price is not the only KPI, but volume as well. We want to know which market segment/distribution channel excel on both fronts. In the diagram below, segment/channel combination is ranked by price on the left, volume on the right. You can tell that even though some pairs’ price is ideal, but their transaction volume is quite low (e.g., Offline TA/TO | GDS). Some pair ranks pretty high in both measures (e.g., Direct|Direct). The third category would be fair on price but leading in volume (Online TA | TA/TO ). Based on this finding, business can optimize channel strategy and give more emphasis to leading pairs.

You can interact with the chart below by zoom in/out or mouse over the dots. Give it a try.

5. Which agent companies are top performers?

From business’s point of view, maximizing revenue is a goal. Since hotel inventory is fairly fixed, how to give more inventory to those performing companies is a key business question. The first chart below is a scatter plot of # of booking and average daily price. We can see that the variation is quite big. We form the hypothesis that a good performing company should satisfy the following criteria:

Have decent amount of booking
Average daily rate is higher than a threshold
Have been actively selling inventory for a period of time
Average daily rate deviation is smaller than a threshold

We use the criteria to single out some companies, highlighted by red circle in the chart below.

Using the criteria, we picked ten companies. The chart below depicts their average adr over time. It’s quite obvious that most companies’ adr fluctuate over time. However, company #40’s performance is very stable. Based on the finding, the business can come up with a strategy to maximize the revenue by giving different companies different inventory or different incentives.

6. Discover the correlation among multiple parameters

From business’s perpective, it’s also interesting to find out if there’s any correlation among different parameters, even though correlation doesn’t imply causation. The chart below depicts the correlation among all numeric fields in the booking data. As you can tell, there’s no strong correlation (correlation coefficient greater than 70%) between any two variables. The only visible positive correlation is between weekday stays and weekend stays. While arrival week number seems to have noticable negative correlation with arrival year. However, it was caused by the starting and end point of the dataset itself, thus no business significance.

7. Lead time Analysis

We want to know how advanced do people book their hotel prior to their arrival. The first chart shows the average lead time (the difference between booking date and actual arrival date, for confirmed booking only) comparison between global and top 6 countries (in terms of # of bookings).

** Observations:**

People from Britain and Germany tends to plan much earlier than people from other countries
Lead time in summer is higher than other seasons
Lead time in December his higher due to Christmas travel

The spread of lead time varies a lot country by country. Again, you can tell from the chart below that lead time for people with Germany and Britain origin varies a lot on any given month, comparing with that of other countries. Peole from Spain tends to have shorter lead time and narrower lead time span, indicating that they tend to have more ad hoc travels. Business can create different marketing strategies based on this finding to change people’s behavior based on thier origin.

booking<-booking %>% mutate(assign_match_reserve = ifelse(as.character(reserved_room_type)==as.character(assigned_room_type), "Yes", "No"), room_change_direction = ifelse(as.character(reserved_room_type)!=as.character(assigned_room_type), paste0(reserved_room_type," to ",assigned_room_type), NA))
room_change_stat<-booking %>% filter(!is.na(room_change_direction)) %>% group_by(hotel) %>% count(reserved_room_type,assigned_room_type) %>% mutate(pct = n/sum(n))
roommatch_by_month<-booking %>%  group_by(arrival_month, hotel) %>% summarize(total = n(), room_match = sum(assign_match_reserve == "Yes"), room_change = sum(assign_match_reserve == "No")) %>% mutate (pct_match = room_match/total, pct_change = room_change/total)

## `summarise()` has grouped output by 'arrival_month'. You can override using the
## `.groups` argument.

ggplot(roommatch_by_month, aes(arrival_month, pct_match, color = hotel)) +
  geom_line()

ggplot(room_change_stat , aes(reserved_room_type,assigned_room_type)) +
  geom_raster(aes(fill = pct)) +
  scale_fill_continuous(high = "darkred", low = "grey") +
  facet_wrap(~hotel)

Summary

The analysis above touched on trending analysis, behavior analysis and performance analysis. More in-depth analysis could be done towards finding out more insights. For example, if the dataset starts in January and ends in December in all three years, then accumulated booking can be compared across years to get a sense of the economy. If in certain year travel industry gets hit by whatever factor, the accumulated booking by months can be different, even though the month by month pattern is the same. Those more advanced analysis is outside the scope of the project.