Members : Soumi Giri, Arvind Kurumbal Radhakrishnan, Samreen Pathan, Saksham Bhutani, Divyanshu Kumar

Hotel Data Project

1. Introduction

1.1 Problem Statement

Tourism has always being a very lucrative industry worth millions of dollars in US alone. Due to the pandemic in 2020 the tourism industry was one of the worst impacted sectors. But as the world is moving towards a new normal we are taking a look at what strategies can be devised to improve booking rates for hotels as well as improve the hotel revenues.

To infer the different marketing and tactical strategies to improve hotel bookings we are taking a look at the hotels data provided to us. The data gives us details of hotel bookings from 1st of July of 2015 to the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. We have 2 hotel types from the same chain in our data Resort Hotel and City Hotel for which we have details regarding bookings like arrival dates, family members, type of room, booking medium etc. Both the hotels are located in Portugal, The Resort Hotel in resort region in Algarve and the City Hotel in Lisbon. Using the data we are trying to identify actionable strategies which would help us increase the booking rates of the hotels while also trying to identify the customer preferences.

1.2 Approach

We have a host of information at our disposal including arrival date information, duration of stay, customer demographics (number of adults/children), booking channel information (travel agents/tour operators details), customer historic behavior (booking changes/cancellations) etc. These features will be analyzed against number of bookings/Average Daily Rate/number of cancellations. This data exploration exercise should help:

  • A person with little context of the field gain insights on trends in the hotel industry
  • Serve as good starting point for someone looking to enter the industry
  • Help existing players modify their market strategies

1.3 Analytical Steps

Following were the steps taken:

  1. Identify features: Analyze all features and identify key features on the basis of the insights to be generated.
  2. Clean data: Diagnose and treat any problems in the data on a per feature basis.
  3. Exploratory Data Analysis:
    • Summarize: Understand distribution of variables, use summary statistics like mean, median etc.
    • Visualize: Use visualizations to better understand the structure and distributions across variables.
  4. Generate hypothesis: On the basis of EDA formulate hypothesis using key features
  5. Test hypothesis: Test generated hypothesis to check validity of claims
  6. Conclusion: Infer result of hypothesis test and generate insights in accordance with the results of the testing

1.4 Benefits of the Analysis

This analysis will give the management team of the Portugal based hotel chain, a good understanding of bookings and cancellation trends over time and across other variables like customer demographics and services provided by the hotel. The insights can also be considered as a generalized trend for the industry across similar geographical regions. In addition to gaining an understanding of prevalent trends, this analysis can also be the foundation for design of marketing and operational strategies.

2. Packages Required

2.1 Packages Used in Analysis

The following packages are loaded to conduct our analysis

library(tidyverse)
library(ggplot2)
library(dplyr)
library(knitr)
library(DT)
library(plotly)
library(viridis)
library(wordcloud)
library(RColorBrewer)
library(forcats)

2.2 Package Description

The following packages are used for our analysis:

  1. dplyr - dplyr is a grammar of data manipulation, providing a consistent set of verbs that helps us solve the most common data manipulation challenges.
  2. ggplot2 - ggplot2 is a system for declaratively creating graphics, to help us visualize our data. We provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
  3. knitr - This is an alternative tool to Sweave with a more flexible design and new features like caching and finer control of graphics. This help us incorporate more dynamic graphic representation (example: table views) for our report.
  4. DT - R data objects (matrices or data frames) can be displayed as tables on HTML pages, and DataTables provides filtering, pagination, sorting, and many other features in the tables.
  5. plotly - Plotly is an R package for creating interactive web-based graphs via plotly’s JavaScript graphing library, plotly.js
  6. viridis - Provides color scales to make plots that are pretty, better represent data, easier to read by those with colorblindness, and prints well in gray scale.
  7. wordcloud - Generates wordcloud in R
  8. RColorBrewer - RColorBrewer is an R packages that uses the work from http://colorbrewer2.org/ to help you choose sensible colour schemes for figures in R. The colors are split into three group, sequential, diverging, and qualitative.
  9. forcats - The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values. Some examples include: fct_reorder() : Reordering a factor by another variable.

3. Data Preparation

3.1 Source of Data

Data used for our analysis can be found here.

3.2 Data Description

The data format is mixed (raw and processed).The data is comprised of two types of hotels i.e. Resort Hotel (40,060 obs) located in Algarve and City Hotel (79,330 obs) located in Lisbon. The data was collected between July 1st, 2015 and August 31st, 2017. Since the data is real, hotel name and any customer identifiers are not included.The data has 32 variables which can be classified into the following categories:

  1. Time Based Features - Gives us when a booking was made, when the booking party arrived and how long the stay was. ArrivalDateDayOfMonth, ArrivalDateMonth, ArrivalDateWeekNumber, ArrivalDateYear, LeadTime etc

  2. Customer Demography Features - Characteristics of the customers who had made the booking. Adults, Babies, Children, Country etc

  3. Booking Profile - These columns provide us a specific characteristics of a customer’s booking. AssignedRoomType, BookingChanges, DaysInWaitingList, IsRepeatedGuest etc

  4. Market Channels - Different channels through which the reservation was made. Agent, Company, DistributionChannel, MarketSegment.

  5. Hotel Services - Different additional services that the hotel provides. meal, required_car_parking_spaces, total_of_special_requests.

The data was collected for educational purposes and is focusing on revenue management. Data was obtained directly from the hotels’ PMS databases’ servers.The PMS assured no missing data exists in its database tables. However, in some categorical variables like Agent or Company, “NULL” is presented as one of the categories. This should not be considered a missing value, but rather as “not applicable”. For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.

The objective of the data is to give a better understanding of this market as multiple things can be explored by this data like customer segmentation, seasonality, cancellation prediction, etc.

A detailed data dictionary view - Link.

3.3 Data Importing and Cleaning

hotels <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')

3.3.1 Understanding Data Structure

In our hotels data, we have 13 character variables, 18 numeric variables and 1 date variable.

str(hotels)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

Below we observe the summary statistics for all ournumerics columns.

num_cols <- unlist(lapply(hotels, is.numeric))

summary(hotels[, num_cols])
##   is_canceled       lead_time   arrival_date_year arrival_date_week_number
##  Min.   :0.0000   Min.   :  0   Min.   :2015      Min.   : 1.00           
##  1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016      1st Qu.:16.00           
##  Median :0.0000   Median : 69   Median :2016      Median :28.00           
##  Mean   :0.3704   Mean   :104   Mean   :2016      Mean   :27.17           
##  3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017      3rd Qu.:38.00           
##  Max.   :1.0000   Max.   :737   Max.   :2017      Max.   :53.00           
##                                                                           
##  arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights
##  Min.   : 1.0              Min.   : 0.0000         Min.   : 0.0        
##  1st Qu.: 8.0              1st Qu.: 0.0000         1st Qu.: 1.0        
##  Median :16.0              Median : 1.0000         Median : 2.0        
##  Mean   :15.8              Mean   : 0.9276         Mean   : 2.5        
##  3rd Qu.:23.0              3rd Qu.: 2.0000         3rd Qu.: 3.0        
##  Max.   :31.0              Max.   :19.0000         Max.   :50.0        
##                                                                        
##      adults          children           babies          is_repeated_guest
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000000   Min.   :0.00000  
##  1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.: 0.000000   1st Qu.:0.00000  
##  Median : 2.000   Median : 0.0000   Median : 0.000000   Median :0.00000  
##  Mean   : 1.856   Mean   : 0.1039   Mean   : 0.007949   Mean   :0.03191  
##  3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.: 0.000000   3rd Qu.:0.00000  
##  Max.   :55.000   Max.   :10.0000   Max.   :10.000000   Max.   :1.00000  
##                   NA's   :4                                              
##  previous_cancellations previous_bookings_not_canceled booking_changes  
##  Min.   : 0.00000       Min.   : 0.0000                Min.   : 0.0000  
##  1st Qu.: 0.00000       1st Qu.: 0.0000                1st Qu.: 0.0000  
##  Median : 0.00000       Median : 0.0000                Median : 0.0000  
##  Mean   : 0.08712       Mean   : 0.1371                Mean   : 0.2211  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                3rd Qu.: 0.0000  
##  Max.   :26.00000       Max.   :72.0000                Max.   :21.0000  
##                                                                         
##  days_in_waiting_list      adr          required_car_parking_spaces
##  Min.   :  0.000      Min.   :  -6.38   Min.   :0.00000            
##  1st Qu.:  0.000      1st Qu.:  69.29   1st Qu.:0.00000            
##  Median :  0.000      Median :  94.58   Median :0.00000            
##  Mean   :  2.321      Mean   : 101.83   Mean   :0.06252            
##  3rd Qu.:  0.000      3rd Qu.: 126.00   3rd Qu.:0.00000            
##  Max.   :391.000      Max.   :5400.00   Max.   :8.00000            
##                                                                    
##  total_of_special_requests
##  Min.   :0.0000           
##  1st Qu.:0.0000           
##  Median :0.0000           
##  Mean   :0.5714           
##  3rd Qu.:1.0000           
##  Max.   :5.0000           
## 

3.3.2 Understanding Data Distribution

Days in waiting list

We observe that 96% bookings were confirmed within a day.

Booking Changes

We observe that for Booking Changes, 95% of the values lie between 0 and 2, 99% between 0 and 4. There are only 267 (0.2%) such cases where there are more than 4 booking changes.

Lead Time

We observe that 3148 bookings in our hotels data has Lead Time of more than a year. From the below data we also see that Resort Hotel has lower average lead time than City Hotel.

hist(hotels$days_in_waiting_list,
     main="Distribution of days in waiting list",
     xlab = "Days in waiting list",
     col = "#00BFC4")
hist(hotels$booking_changes,
     main="Distribution of Booking changes",
     xlab = "Booking changes",
     ylim=c(0,80000),
     col = "#00BFC4")
hist(hotels$lead_time,
     main="Distribution of Lead time",
     xlab = "Lead time",
     col = "#00BFC4")

Previous cancellations

We observe that for Previous cancellations, 99% of the values lie between 0 and 1. There are only 317 such cases where there are more than 2 previous cancellations.

Previous bookings not canceled

We observe that for Previous bookings not canceled, 99% of the values lie between 0 and 7. There are only 552 such cases where there are more than 7 non-cancellations.

#Understand variable and visually inspect
hist(hotels$previous_cancellations,
     main="Distribution of Previous Cancellation",
     xlab = "Previous Cancellations",
     col = "#00BFC4")
hist(hotels$previous_bookings_not_canceled,
     main="Distribution of Non Cancelled Bookings",
     xlab = "Previous non cancelled bookings",
     col = "#00BFC4")

3.3.3 NULL/NA Identification and Treatment

colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

As mentioned earlier for the agent and company columns, the string “NULL” indicates that the booking wasn’t made through an agent/company. Below are the summary numbers for such bookings.

length(hotels$agent[hotels$agent == "NULL"])
## [1] 16340
length(hotels$company[hotels$company == "NULL"])
## [1] 112593

Agent column has 16340 (~13%) NULLs and Company has 112593 (~94%) NULLs. This indicates that these bookings were not made via any agent or company.

Rest of the columns do not have any NULLs or missing values.

3.3.4 Checking for Outliers

We check for outliers for the following numeric variables

ADR

We observe that for ADR, all values lie within 550 except one which is way greater, 5400. We are going with an assumption that this is a data entry issue and treating it as an outlier, replacing it by 0.

boxplot(hotels$adr)

After treating the outlier the distribution looks like:

hotels$adr[hotels$adr=="5400"] <- 0
hist(hotels$adr,
     main="Distribution of ADR",
     xlab = "ADR",
     col = "#00BFC4", breaks = 50)

3.4 Final Dataset

datatable(hotels[1:100,], caption = 'Table 1: Clean and tidy data.')

3.5 Data Summary

4. Exploratory Data Analysis

From the 32 features present in our data we can broadly classify the features in the following categories:

  • Time based features - Gives us when a booking was made, when the booking party arrived and how long the stay was
  • Customer Demography features - Characteristics of the customers who had made the booking. These consists of customer country, family size, etc.
  • Booking Profile features - These columns provide us a specific characteristics of a customer booking entry. Example - booking duration. stay in weekends, stay in week night, etc.
  • Market Channel features - Different channels through which the reservation was made. Example - online, corporate, etc.
  • Hotel Services - Different additional services that the hotel provides. Example - meals, car parking, etc.

We will identify insights and strategies on these 4 categories to understand the factors that influence a hotel booking and cancellation.

4.1 Time Based Features

4.1.1 Hypothesis Proposed

We have information regarding the lead time between booking and arrival, arrival date, month, year, duration of stay in the data. Leveraging these, the below hypothesis will be checked:

  • Relation between lead time and cancellation, i.e. are bookings that were made a long time back canceled more often or are recently made bookings canceled more often ?

  • Relation between lead time and arrival month, i.e. for which months do guests book a long time in advance ?

  • Which months/weeks have more bookings ?

  • Which days of each month have more bookings ? Are there certain days in certain months where bookings spike, for example the Dec 25 - Jan 1st week? This combined with the previous analysis can be used for dynamic pricing.

On initial exploration of number of bookings across years, we see a big jump between 2015 and 2016 (almost 30%), the jump between 2016 and 2017 is not as much. Please note that here we exclude the cancellations and consider the number of bookings per month to account for the availability of partial data across years. This might not the right approximation as we do not account for seasonality here, for instance maybe hotel bookings might spike during the holiday season and the 2017 data is only till the month of August. This is something we will explore while carrying out the detailed EDAs.

bookings_by_year <- 
  hotels %>%
    filter(is_canceled == 0) %>%
      group_by(arrival_date_year) %>%
        summarise(bookings_per_month = n()/n_distinct(arrival_date_month))

ggplot(bookings_by_year,aes(x=arrival_date_year,y=bookings_per_month)) +
  geom_bar(position = "dodge", stat = "identity", fill = "#88ddbb") + 
    geom_text(aes(label=paste(round(bookings_per_month,1)))) + 
      xlab("Arrival Year") +
        ylab("Average Bookings per Month") +
          ggtitle("Bookings per month across years")  

4.2 Hotel Services

4.2.1 Hypothesis Proposed

From the data provided we see there are few additional services that the customer can opt during booking their stay in the hotels. These services are - meal, car parking and other special request. From the hotel services that are present in the data few hypothesis we propose are:

  • Customer opting for 1 or more services are less likely to cancel their reservations
  • Bookings with higher group members and families are more likely to opt for services
  • Families with kids are more likely to opt for meal service during their stay
  • Customers of Resort Hotel are more likely to opt for special service requests than City Hotel Customers
  • During the holiday seasons number of special requests increases for both resort and city hotels

To check the different hypothesis regarding hotel services we will check the booking and cancellation count for Resort Hotel and City Hotel against the different services provided. We will create an additional field called services_offered in our data set which will indicate if a hotel is providing any additional services to the customer.

hotels$services_offered <- ifelse((hotels$meal != "Undefined" | hotels$required_car_parking_spaces > 0 | hotels$total_of_special_requests > 0), "Yes", "No")

On initial exploration we observe that both Resort Hotel and City Hotel offer 1 or more services and higher number customers tend to opt for these services. We can see the distribution as below

hotel_services <- hotels %>% mutate(canceled = ifelse(is_canceled == 1, "yes", "no")) %>%  group_by(hotel, services_offered, canceled) %>% summarise(bookings = n())

ggplot(data = hotel_services, aes(fill = canceled, x = services_offered, y = bookings)) +
  geom_bar(position="dodge", stat="identity") +
  ggtitle("Bookings and Cancellation by Hotel type and service offered") +
  facet_wrap(~hotel) + 
  geom_text(aes(label=bookings), position=position_dodge(width=0.9), vjust=-0.25)

We can further check which hotels provide which services

service_table <- hotels %>% mutate(meal_service = ifelse(meal == "Undefined", "no", "yes"),
                                   car_parking = ifelse(required_car_parking_spaces > 0, "yes", "no"),
                                   special_request = ifelse(total_of_special_requests > 0, "yes", "no")) %>% 
                select(hotel, meal_service, car_parking, special_request) %>%
                filter(meal_service == "yes", car_parking == "yes", special_request == "yes") %>% 
                distinct(hotel, meal_service, car_parking, special_request, .keep_all = TRUE)
kable(service_table, caption = "Service offering by hotel type")
Service offering by hotel type
hotel meal_service car_parking special_request
Resort Hotel yes yes yes
City Hotel yes yes yes

From the above table we see both Resort Hotel and City Hotel provide all 3 types of additional services. Now that we have checked our data we check our stated hypothesis to derive actionable strategies.

4.3 Customer Demography features

4.3.1 Hypothesis Proposed

To ensure anonymity in data the PMS system has removed any customer identifiers. We have few demographic data in our hotels data table - number of members, country of origin, type of deposit paid, etc. Based on these characteristics we propose the following key question which will help us identify factors influencing bookings and cancellations:

  • The customers who are part of a group or group with children tend to have a longer stay
  • Groups or groups with children tend to cancel the bookings more as the count increases
  • Booking would peak around July-August(mid - year) for visitors from different countries

One of the key inferences that we get from the data is that majority of tourist who book the hotel stay are from Portugal which is the home country followed by tourists from Great Britain which can be seen below

hotels_country <- hotels %>% select(hotel, country, arrival_date_year, is_canceled) %>% filter(is_canceled == 0) %>% 
                  group_by(country) %>% summarise(bookings = n()) %>% arrange(desc(bookings))

wordcloud(words = hotels_country$country, freq = hotels_country$bookings, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

To check the hypothesis related to family we need to come up with a logic which would qualify an entry as family. For that, an entry with a non-negative count of either babies or children and a non-negative count of an adult would qualify as a family and is named is_family.A new variable named booking_duration is created that will indicate the number of days a customer stayed. We’ll sum stays_in_weekend_nights and stays_in_week_nights to compute this.

is_family <- ((hotels$babies> 0  | hotels$children > 0) & hotels$adults > 0)
hotels$is_family <- is_family
booking_duration <- hotels$stays_in_weekend_nights + hotels$stays_in_week_nights
hotels$booking_duration <- booking_duration

Making a new variable that includes countries that had the most visitors. This has top 6 countries named and rest of the countries as others.

hotels$new_country <- hotels$country
hotels$new_country[hotels$new_country != "PRT" & hotels$new_country != "GBR" 
                   & hotels$new_country != "FRA" & hotels$new_country != "ESP"
                   & hotels$new_country != "DEU" & hotels$new_country != "ITA"] <- "Others"
hotels$new_country[hotels$new_country == "PRT"] <- "Portugal"
hotels$new_country[hotels$new_country == "GBR"] <- "Great Britain"
hotels$new_country[hotels$new_country == "FRA"] <- "France"
hotels$new_country[hotels$new_country == "ESP"] <- "Spain"
hotels$new_country[hotels$new_country == "DEU"] <- "Germany"
hotels$new_country[hotels$new_country == "ITA"] <- "Italy"

4.4 Booking Profile Features

4.4.1 Hypothesis Proposed

The booking profile of the customer in consonance with the other feature types can be used to derive holistic insights from the data. Following hypothesis will be tested:

  • There is a relation between the distribution channel and ability of the hotel to provide the requested room
  • Hotels in the city have greater traffic and are more susceptible to overbooking
  • There is a significant relation between the room types and the adr
  • There is a significant relation between the market segment and the assigned room types
  • There is a significant relation between the distribution channel and the reserved room types
  • Certain room types can have a larger lead time
  • Majority of the people tend to not pay deposits while booking
  • Groups with children tend to pay deposits while booking
  • People who do not pay deposit while booking tend to cancel the booking

We create a new variable that tracks the ability of the hotel to fulfil the booking by assigning the reserved room to its customer.

hotels$same_room <- (hotels$reserved_room_type == hotels$assigned_room_type)

We create new variables to track groups with children and booking duration of all customers.

is_family <- ((hotels$babies> 0  | hotels$children > 0) & hotels$adults > 0)
hotels$is_family <- is_family
booking_duration <- hotels$stays_in_weekend_nights + hotels$stays_in_week_nights
hotels$booking_duration <- booking_duration
is_deposit = ((hotels$deposit_type == "Non Refund" |     hotels$deposit_type == "Refundable"))
hotels$is_deposit <- is_deposit
hotels$is_deposit[hotels$is_deposit == TRUE] <- "Deposited"
hotels$is_deposit[hotels$is_deposit == FALSE] <- "No Deposit"

4.5 Market Channel Features

4.5.1 Hypothesis Proposed

The market channel metrics conveys the booking channel that was used by the customers for making the reservation. Our market channel metrics contain - Agent, Company, Distribution Channel and Market Segment. Based on our initial exploration we put forward the following hypothesis to gauge the impact of the booking channel on customer bookings:

  • The bookings made through travel agencies tend to not get cancelled
  • People with more family members tend to make bookings with travel agencies

On a overall level we see that customers prefer to make reservations through online travel agencies as we can see in the below plots for both city Hotel and Resort Hotel.

hotel_mkseg <- hotels %>% select(hotel, arrival_date_year, market_segment, is_canceled) %>% filter(is_canceled == 0, ) %>% 
              group_by(hotel, arrival_date_year, market_segment) %>% summarise(bookings = n()) 

ggplot(data = hotel_mkseg[hotel_mkseg$hotel == "City Hotel",], aes(x = reorder(market_segment, bookings), y = bookings)) +
  geom_bar(stat = "identity", fill = "#00BFC4") +
  coord_flip() +
  scale_y_continuous(name="Total Bookings") +
  scale_x_discrete(name="Booking Channel") +
  ggtitle("City Hotel Bookings through year trend") +
  facet_wrap(~arrival_date_year)

ggplot(data = hotel_mkseg[hotel_mkseg$hotel == "Resort Hotel",], aes(x = reorder(market_segment, bookings), y = bookings)) +
  geom_bar(stat = "identity", fill = "#00BFC4") +
  coord_flip() +
  scale_y_continuous(name="Total Bookings") +
  scale_x_discrete(name="Booking Channel") +
  ggtitle("Resort Hotel Bookings through year trend") +
  facet_wrap(~arrival_date_year)

Over the years the online travel agencies have gained more popularity which is reflected in our data as the number of bookings increase on this channel.

5. Insights and Recommendations

Below are a few recommendations based on our analysis:

  1. Early Bird Offers: We observe that a lot of cancellations occur in the last few lead time buckets. To reduce these we can introduce early bird discounts as pricing is most likely the reason that these bookings get dropped
  2. Double Book Anticipating Cancellations: When the party size is greater than 12 we notice that the cancellation rate is 100%, so it maybe valuable to double book the rooms allocated in such instances so as to not lose out on bookings
  3. Testing Out New Services: Special service requests are high in the summer months, this might be the best time to test out planned service launches as customers tend to opt for them during these months
  4. Offering Bump-ups on Meal Services: To improve traction for the less preferred meal services, discounts can be offered on service bump-ups, i.e. SC/Undefined to BB, BB to HB and HB to FB
  5. Incentive to Improve Bookings with Deposit Count: Deposit based bookings can be promoted by reducing prices on them, this is often employed by travel booking websites

6. Conclusion

The objective of the exploration was to understand the key drivers behind Resort and City hotel bookings in Portugal. From the detailed analysis we were able to identify few key drivers like - lead time, group size, service types, etc. Based on these influential factors, strategic and marketing recommendations were devised. These factor identification can be further leveraged to understand sales fluctuations, booking prediction and campaign design for hotels from this regions. Few of the broader insights can be leveraged across the hotel industry.