Members : Soumi Giri, Arvind Kurumbal Radhakrishnan, Samreen Pathan, Saksham Bhutani, Divyanshu Kumar

1. Introduction

1.1 Problem Statement

Tourism has always being a very lucrative industry worth millions of dollars in US alone. Due to the pandemic in 2020 the tourism industry was one of the worst impacted sectors. But as the world is moving towards a new normal we are taking a look at what strategies can be devised to improve booking rates for hotels as well as improve the hotel revenues.

To infer the different marketing and tactical strategies to improve hotel bookings we are taking a look at the hotels data provided to us. The data gives us details of hotel bookings from 1st of July of 2015 to the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. We have 2 hotel types from the same chain in our data Resort Hotel and City Hotel for which we have details regarding bookings like arrival dates, family members, type of room, booking medium etc. Both the hotels are located in Portugal, The Resort Hotel in resort region in Algarve and the City Hotel in Lisbon. Using the data we are trying to identify actionable strategies which would help us increase the booking rates of the hotels while also trying to identify the customer preferences.

1.2 Approach

We have a host of information at our disposal including arrival date information, duration of stay, customer demographics (number of adults/children), booking channel information (travel agents/tour operators details), customer historic behavior (booking changes/cancellations) etc. These features will be analyzed against number of bookings/Average Daily Rate/number of cancellations. This data exploration exercise should help:

  • A person with little context of the field gain insights on trends in the hotel industry
  • Serve as good starting point for someone looking to enter the industry
  • Help existing players modify their market strategies

1.3 Analytical Steps

Following were the steps taken:

  1. Identify features: Analyze all features and identify key features on the basis of the insights to be generated.
  2. Clean data: Diagnose and treat any problems in the data on a per feature basis.
  3. Exploratory Data Analysis:
    • Summarize: Understand distribution of variables, use summary statistics like mean, median etc.
    • Visualize: Use visualizations to better understand the structure and distributions across variables.
  4. Generate hypothesis: On the basis of EDA formulate hypothesis using key features
  5. Test hypothesis: Test generated hypothesis to check validity of claims
  6. Conclusion: Infer result of hypothesis test and generate insights in accordance with the results of the testing

1.4 Benefits of the Analysis

This analysis will give the management team of the Portugal based hotel chain, a good understanding of bookings and cancellation trends over time and across other variables like customer demographics and services provided by the hotel. The insights can also be considered as a generalized trend for the industry across similar geographical regions. In addition to gaining an understanding of prevalent trends, this analysis can also be the foundation for design of marketing and operational strategies.

2. Packages Required

2.1 Packages Used in Analysis

The following packages are loaded to conduct our analysis

library(ggplot2)
library(dplyr)
library(knitr)
library(DT)

2.2 Package Description

The following packages are used for our analysis:

  1. dplyr - dplyr is a grammar of data manipulation, providing a consistent set of verbs that helps us solve the most common data manipulation challenges.
  2. ggplot2 - ggplot2 is a system for declaratively creating graphics, to help us visualize our data. We provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
  3. knitr - This is an alternative tool to Sweave with a more flexible design and new features like caching and finer control of graphics. This help us incorporate more dynamic graphic representation (example: table views) for our report.
  4. DT - R data objects (matrices or data frames) can be displayed as tables on HTML pages, and DataTables provides filtering, pagination, sorting, and many other features in the tables.

3. Data Preparation

3.1 Source of Data

Data used for our analysis can be found here.

3.2 Data Description

The data format is mixed (raw and processed).The data is comprised of two types of hotels i.e. Resort Hotel (40,060 obs) located in Algarve and City Hotel (79,330 obs) located in Lisbon. The data was collected between July 1st, 2015 and August 31st, 2017. Since the data is real, hotel name and any customer identifiers are not included.The data has 32 variables which can be classified into the following categories:

  1. Time based features - ArrivalDateDayOfMonth, ArrivalDateMonth, ArrivalDateWeekNumber, ArrivalDateYear, LeadTime, ReservationStatus, ReservationStatusDate

  2. Customer Profile - Adults, Babies, Children, Country, CustomerType, RequiredCardParkingSpaces, PreviousCancellations, PreviousBookingsNotCanceled, TotalOfSpecialRequests

  3. Booking Profile - AssignedRoomType, BookingChanges, DaysInWaitingList, IsCanceled, IsRepeatedGuest, LeadTime, ReservedRoomType, StaysInWeekendNights, StaysInWeekNights

  4. Market Channels - Agent, Company, DistributionChannel, MarketSegment

  5. Financials - ADR, DepositType

The data was collected for educational purposes and is focusing on revenue management. Data was obtained directly from the hotels’ PMS databases’ servers.The PMS assured no missing data exists in its database tables. However, in some categorical variables like Agent or Company, “NULL” is presented as one of the categories. This should not be considered a missing value, but rather as “not applicable”. For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.

The objective of the data is to give a better understanding of this market as multiple things can be explored by this data like customer segmentation, seasonality, cancellation prediction, etc.

A detailed data dictionary view - Link.

3.3 Data Importing and Cleaning

hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')

3.3.1 Understanding Data Structure

In our hotels data, we have 13 character variables, 18 numeric variables and 1 date variable.

str(hotels)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(hotels)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 

3.3.2 Understanding Data Distribution

Type of room reserved vs assigned

table(hotels$reserved_room_type, hotels$assigned_room_type)
##    
##         A     B     C     D     E     F     G     H     I     K     L     P
##   A 73598  1123  1447  7548  1156   417   186    94   215   210     0     0
##   B   111   988     0     5     2     2     8     0     0     2     0     0
##   C     5     2   883     6     4     2    10    10    10     0     0     0
##   D   312    27    34 17736   686   204    82     9    67    44     0     0
##   E    15     3     8    22  5923   404   100     4    40    16     0     0
##   F     6    17     0     4    31  2707   116     3    10     3     0     0
##   G     5     2     2     0     4    14  2041     7    15     4     0     0
##   H     0     0     0     1     0     0    10   584     6     0     0     0
##   L     1     1     1     0     0     1     0     1     0     0     1     0
##   P     0     0     0     0     0     0     0     0     0     0     0    12

Types of rooms which repeated guests prefer

table(hotels$is_repeated_guest, hotels$assigned_room_type)
##    
##         A     B     C     D     E     F     G     H     I     K     L     P
##   0 71910  2112  2288 24411  7539  3631  2465   693   304   214     1    12
##   1  2143    51    87   911   267   120    88    19    59    65     0     0

Are guests more likely to return if room booked by the customer is assigned

table(hotels$is_repeated_guest,hotels$reserved_room_type == hotels$assigned_room_type)
##    
##      FALSE   TRUE
##   0  13857 101723
##   1   1060   2750

Cancellations when correct room booked by the customer is assigned

table(hotels$is_canceled,hotels$reserved_room_type == hotels$assigned_room_type)
##    
##     FALSE  TRUE
##   0 14115 61051
##   1   802 43422

Booking changes when room booked by the customer is assigned

table(hotels$booking_changes, hotels$reserved_room_type == hotels$assigned_room_type)
##     
##      FALSE  TRUE
##   0  11028 90286
##   1   2886  9815
##   2    642  3163
##   3    195   732
##   4     80   296
##   5     36    82
##   6     16    47
##   7      9    22
##   8      6    11
##   9      3     5
##   10     2     4
##   11     1     1
##   12     1     1
##   13     3     2
##   14     3     2
##   15     2     1
##   16     0     2
##   17     2     0
##   18     0     1
##   20     1     0
##   21     1     0

Correct room assignments by hotel

table(hotels$hotel,hotels$reserved_room_type == hotels$assigned_room_type)
##               
##                FALSE  TRUE
##   City Hotel    7192 72138
##   Resort Hotel  7725 32335

Bookings previously cancelled when room booked was room assigned

table(hotels$previous_cancellations > 0,hotels$reserved_room_type == hotels$assigned_room_type)
##        
##         FALSE  TRUE
##   FALSE 14724 98182
##   TRUE    193  6291
table(hotels$previous_cancellations>1,hotels$is_canceled)
##        
##             0     1
##   FALSE 74961 43996
##   TRUE    205   228

Booking distributions by customer type, deposit type and distribution channel used for booking across both the hotels

table(hotels$hotel,hotels$customer_type)
##               
##                Contract Group Transient Transient-Party
##   City Hotel       2300   293     59404           17333
##   Resort Hotel     1776   284     30209            7791
table(hotels$hotel,hotels$deposit_type)
##               
##                No Deposit Non Refund Refundable
##   City Hotel        66442      12868         20
##   Resort Hotel      38199       1719        142
table(hotels$hotel,hotels$distribution_channel)
##               
##                Corporate Direct   GDS TA/TO Undefined
##   City Hotel        3408   6780   193 68945         4
##   Resort Hotel      3269   7865     0 28925         1

Majority booking in both hotels were made under transient customer type, with no deposit while booking, and through travel agents/tour operators.

Distribution of arrival_date_year

unique(hotels$arrival_date_year)
## [1] 2015 2016 2017
# Checking if all months are present in all years

hotels %>% 
  group_by(arrival_date_year) %>% 
  summarise(Count = n_distinct(arrival_date_month))
## # A tibble: 3 x 2
##   arrival_date_year Count
##               <dbl> <int>
## 1              2015     6
## 2              2016    12
## 3              2017     8
# Checking distinct dates in each month of each year

hotels %>% 
  group_by(arrival_date_year,arrival_date_month) %>% 
  summarise(Count = n_distinct(arrival_date_day_of_month))
## `summarise()` has grouped output by 'arrival_date_year'. You can override using the `.groups` argument.
## # A tibble: 26 x 3
## # Groups:   arrival_date_year [3]
##    arrival_date_year arrival_date_month Count
##                <dbl> <chr>              <int>
##  1              2015 August                31
##  2              2015 December              31
##  3              2015 July                  31
##  4              2015 November              30
##  5              2015 October               31
##  6              2015 September             30
##  7              2016 April                 30
##  8              2016 August                31
##  9              2016 December              31
## 10              2016 February              29
## # ... with 16 more rows

3.3.3 NULL/NA Identification and Treatment

colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

As mentioned earlier for the agent and company columns, the string “NULL” indicates that the booking wasn’t made through an agent/company. Below are the summary numbers for such bookings.

length(hotels$agent[hotels$agent == "NULL"])
## [1] 16340
length(hotels$company[hotels$company == "NULL"])
## [1] 112593

Agent column has 16340 (~13%) NULLs and Company has 112593 (~94%) NULLs. This indicates that these bookings were not made via any agent or company.

Rest of the columns do not have any NULLs or missing values.

3.3.4 Checking for Outliers

We check for outliers for the following numeric variables

ADR

We observe that for ADR, all values lie within 550 except one which is way greater, 5400. We are going with an assumption that this is a data entry issue and treating it as an outlier, replacing it by 0.

boxplot(hotels$adr)

hotels$adr[hotels$adr=="5400"] <- 0
hist(hotels$adr, breaks = 50)

Days in waiting list

We observe that 96% bookings were confirmed within a day.

hist(hotels$days_in_waiting_list)

Booking Changes

We observe that for Booking Changes, 95% of the values lie between 0 and 2 and 99% between 0 and 4. There are only 267 such cases where there are more than 4 booking changes.

#Understand variable and visually inspect
table(hotels$booking_changes)
## 
##      0      1      2      3      4      5      6      7      8      9     10 
## 101314  12701   3805    927    376    118     63     31     17      8      6 
##     11     12     13     14     15     16     17     18     20     21 
##      2      2      5      5      3      2      2      1      1      1
#identify outliers
quantile(hotels$booking_changes,probs = c(0.005,0.995))
##  0.5% 99.5% 
##     0     4
table(hotels$booking_changes > 4)
## 
##  FALSE   TRUE 
## 119123    267
#Prepped column
hist(hotels$booking_changes)

Previous cancellations

We observe that for Previous cancellations, 99% of the values lie between 0 and 1. There are only 317 such cases where there are more than 2 previous cancellations.

#Understand variable and visually inspect
table(hotels$previous_cancellations)
## 
##      0      1      2      3      4      5      6     11     13     14     19 
## 112906   6051    116     65     31     19     22     35     12     14     19 
##     21     24     25     26 
##      1     48     25     26
#identify outliers
quantile(hotels$previous_cancellations,probs = c(0.005,0.995))
##  0.5% 99.5% 
##     0     1
table(hotels$previous_cancellations > 2)
## 
##  FALSE   TRUE 
## 119073    317
#Prepped column
hist(hotels$previous_cancellations)

Previous bookings not canceled

We observe that for Previous bookings not canceled, 99% of the values lie between 0 and 7. There are only 552 such cases where there are more than 7 non-cancellations.

#Understand variable and visually inspect
table(hotels$previous_bookings_not_canceled)
## 
##      0      1      2      3      4      5      6      7      8      9     10 
## 115770   1542    580    333    229    181    115     88     70     60     53 
##     11     12     13     14     15     16     17     18     19     20     21 
##     43     37     30     28     21     20     16     14     13     12     12 
##     22     23     24     25     26     27     28     29     30     31     32 
##     10      7      9     17      7      9      7      6      4      2      2 
##     33     34     35     36     37     38     39     40     41     42     43 
##      1      1      1      1      1      1      1      1      1      1      1 
##     44     45     46     47     48     49     50     51     52     53     54 
##      2      1      1      1      2      1      1      1      1      1      1 
##     55     56     57     58     59     60     61     62     63     64     65 
##      1      1      1      2      1      1      1      1      1      1      1 
##     66     67     68     69     70     71     72 
##      1      1      1      1      1      1      1
#identify outliers
quantile(hotels$previous_bookings_not_canceled,probs = c(0.005,0.995))
##  0.5% 99.5% 
##     0     7
table(hotels$previous_bookings_not_canceled > 7)
## 
##  FALSE   TRUE 
## 118838    552
#Prepped column
hist(hotels$previous_bookings_not_canceled)

Lead Time

We observe that 3148 bookings in our hotels data has Lead Time of more than a year. From the below data we also see that Resort Hotel has lower average lead time than City Hotel.

#Understand variable and visually inspect
hist(hotels$lead_time)

#identify outliers
table(hotels$lead_time > 365)
## 
##  FALSE   TRUE 
## 116242   3148
#Checking lead time by hotel type
hotels %>% 
  group_by(hotel) %>% 
  summarise(average_lead_time = mean(lead_time),
            min_lead_time = min(lead_time),
            max_lead_time = max(lead_time)) 
## # A tibble: 2 x 4
##   hotel        average_lead_time min_lead_time max_lead_time
##   <chr>                    <dbl>         <dbl>         <dbl>
## 1 City Hotel               110.              0           629
## 2 Resort Hotel              92.7             0           737

3.4 Final Dataset

datatable(hotels[1:100,], caption = 'Table 1: Clean and tidy data.')

3.5 Data Summary

summary(hotels)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr        
##  Length:119390      Min.   :  0.000      Length:119390      Min.   : -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.: 69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median : 94.56  
##                     Mean   :  2.321                         Mean   :101.79  
##                     3rd Qu.:  0.000                         3rd Qu.:126.00  
##                     Max.   :391.000                         Max.   :510.00  
##                                                                             
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 

4. Exploratory Data Analysis

From the 32 features present in our data we can broadly classify the features in the following categories:

  • Time based features - Gives us when a booking was made, when the booking party arrived and how long the stay was
  • Customer Demography features - Characteristics of the customers who had made the booking. These consists of customer country, family size, etc.
  • Booking Profile features - These columns provide us a specific characteristics of a customer booking entry. Example - booking duration. stay in weekends, stay in week night, etc.
  • Market Channel features - Different channels through which the reservation was made. Example - online, corporate, etc.
  • Hotel Services - Different additional services that the hotel provides. Example - meals, car parking, etc.

We will identify insights and strategies on these 4 categories to understand the factors that influence a hotel booking and cancellation.

4.1 Time Based Features

4.1.1 Hypothesis Proposed

We have information regarding the lead time between booking and arrival, arrival date, month, year, duration of stay in the data. Leveraging these, the below hypothesis are proposed:

  • Relation between lead time and cancellation, i.e. are bookings that were made a long time back canceled more often or are recently made bookings canceled more often ?
  • Which months/weeks have more bookings/ADR/cancellations ?
  • Which days of each month have more bookings ? Are there certain days in certain months where bookings spike, for example the Dec 25 - Jan 1st week? This combined with the previous analysis can also be used for dynamic pricing.
  • Do people stay at resort hotels more during weekends or city hotels ? Vacation vs Business sort of trend ? Analysis of weekday nights and weekend nights for resort hotel vs city hotel.

On initial exploration of number of bookings across years, we see a big jump between 2015 and 2016 (almost 30%), the jump between 2016 and 2017 is not as much. Please note that here we exclude the cancellations and consider the number of bookings per month to account for the availability of partial data across years. This might not the right approximation as we do not account for seasonality here, for instance maybe hotel bookings might spike during the holiday season and the 2017 data is only till the month of August. This is something we will explore while carrying out the detailed EDAs.

bookings_by_year <- 
  hotels %>%
    filter(is_canceled == 0) %>%
      group_by(arrival_date_year) %>%
        summarise(bookings_per_month = n()/n_distinct(arrival_date_month))

ggplot(bookings_by_year,aes(x=arrival_date_year,y=bookings_per_month)) +
  geom_bar(position = "dodge", stat = "identity", fill = "#00BFC4") + 
    geom_text(aes(label=paste(round(bookings_per_month,1))), position=position_dodge(width=0.9), vjust=-0.25) + 
    ggtitle("Bookings trends over the years")  

4.2 Hotel Services

4.2.1 Hypothesis Proposed

From the data provided we see there are few additional services that the customer can opt during booking their stay in the hotels. These services are - meal, car parking and other special request. From the hotel services that are present in the data few hypothesis we propose are:

  • Customer opting for 1 or more services are less likely to cancel their reservations
  • Bookings with higher group members and families are more likely to opt for services
  • Families with kids are more likely to opt for meal service during their stay
  • Customers of Resort Hotel are more likely to opt for special service requests than City Hotel Customers
  • During the holiday seasons number of special requests increases for both resort and city hotels

To check the different hypothesis regarding hotel services we will check the booking and cancellation count for Resort Hotel and City Hotel against the different services provided. We will create an additional field called services_offered in our data set which will indicate if a hotel is providing any additional services to the customer.

hotels$services_offered <- ifelse((hotels$meal != "Undefined" | hotels$required_car_parking_spaces > 0 | hotels$total_of_special_requests > 0), "Yes", "No")

On initial exploration we observe that both Resort Hotel and City Hotel offer 1 or more services and higher number customers tend to opt for these services. We can see the distribution as below

hotel_services <- hotels %>% mutate(canceled = ifelse(is_canceled == 1, "yes", "no")) %>%  group_by(hotel, services_offered, canceled) %>% summarise(bookings = n())

ggplot(data = hotel_services, aes(fill = canceled, x = services_offered, y = bookings)) +
  geom_bar(position="dodge", stat="identity") +
  ggtitle("Bookings and Cancellation by Hotel type and service offered") +
  facet_wrap(~hotel) + 
  geom_text(aes(label=bookings), position=position_dodge(width=0.9), vjust=-0.25)

We can further check which hotels provide which services

service_table <- hotels %>% mutate(meal_service = ifelse(meal == "Undefined", "no", "yes"),
                                   car_parking = ifelse(required_car_parking_spaces > 0, "yes", "no"),
                                   special_request = ifelse(total_of_special_requests > 0, "yes", "no")) %>% 
                select(hotel, meal_service, car_parking, special_request) %>%
                filter(meal_service == "yes", car_parking == "yes", special_request == "yes") %>% 
                distinct(hotel, meal_service, car_parking, special_request, .keep_all = TRUE)
kable(service_table, caption = "Service offering by hotel type")
Service offering by hotel type
hotel meal_service car_parking special_request
Resort Hotel yes yes yes
City Hotel yes yes yes

From the above table we see both Resort Hotel and City Hotel provide all 3 types of additional services. Now that we have checked our data we check our stated hypothesis to derive actionable strategies.

4.3 Customer Demography features

4.3.1 Hypothesis Proposed

To ensure anonymity in data the PMS system has removed any customer identifiers. We have few demographic data in our hotels data table - number of members, country of origin, type of deposit paid, etc. Based on these characteristics we propose the following key question which will help us identify factors influencing bookings and cancellations:

  • Majority of the people tend to not pay deposits while booking
  • People with more family members tend to pay deposits while booking
  • People who do not pay deposit while booking tend to cancel the booking
  • The least amount of cancellations occur in the booking types which are transient
  • The customers who are part of a family tend to have a longer stay.
  • The customers who are not part of a family trip tend to visit more on weekdays.
  • The customers who are part of a family are more likely to book FB(Full-board) meals.
  • Customers who do not belong to Portugal are likely to stay in weekends.
  • Resort hotels is likely to have more booking for weekend for customers from Portugal.

One of the key inferences that we get from the data is that majority of tourist who book the hotel stay are from Portugal which is the home country followed by tourists from Great Britain which can be seen below

hotels_country <- hotels %>% select(hotel, country, arrival_date_year, is_canceled) %>% filter(is_canceled == 0) %>% 
                  group_by(country) %>% summarise(bookings = n()) %>% arrange(desc(bookings))

ggplot(data = hotels_country[1:10,], aes(x = reorder(country, bookings), y = bookings)) +
  geom_bar(stat = "identity", fill = "#00BFC4") +
  coord_flip() +
  scale_y_continuous(name="Total Bookings") +
  scale_x_discrete(name="Country of origin") +
  ggtitle("Top 10 countries of origin")

To check the hypothesis related to family we need to come up with a logic which would qualify an entry as family. For that, an entry with a non-negative count of either babies or children and a non-negative count of an adult would qualify as a family and is named is_family.A new variable named booking_duration is created that will indicate the number of days a customer stayed. We’ll sum stays_in_weekend_nights and stays_in_week_nights to compute this.

is_family <- ((hotels$babies> 0  | hotels$children > 0) & hotels$adults > 0)
hotels$is_family <- is_family
booking_duration <- hotels$stays_in_weekend_nights + hotels$stays_in_week_nights
hotels$booking_duration <- booking_duration

4.4 Booking Profile Features

4.4.1 Hypothesis Proposed

The booking profile of the customer in consonance with the other feature types can be used to derive holistic insights from the data. Following hypothesis will be tested:

  • There is a relation between the distribution channel and ability of the hotel to provide the requested room
  • Hotels in the city have greater traffic and are more susceptible to overbooking
  • There is a significant relation between the room types and the adr
  • There is a significant relation between the market segment and the assigned room types
  • There is a significant relation between the distribution channel and the reserved room types
  • Certain room types can have a larger lead time
  • Larger groups have greater booking changes
  • The number of booking changes is significantly related to the number of previous cancellations
  • Larger lead time can lead to larger number of booking changes

We create a new variable that tracks the ability of the hotel to fulfil the booking by assiging the reserved room to its customer.

hotels$same_room <- (hotels$reserved_room_type == hotels$assigned_room_type)

4.5 Market Channel Features

4.5.1 Hypothesis Proposed

The market channel metrics conveys the booking channel that was used by the customers for making the reservation. Our market channel metrics contain - Agent, Company, Distribution Channel and Market Segment. Based on our initial exploration we put forward the following hypothesis to gauge the impact of the booking channel on customer bookings:

  • The bookings made through travel agencies tend to not get cancelled
  • People with more family members tend to make bookings with travel agencies
  • The bookings made through companies tend to get cancelled more

On a overall level we see that customers prefer to make reservations through online travel agencies as we can see in the below plots for both city Hotel and Resort Hotel.

hotel_mkseg <- hotels %>% select(hotel, arrival_date_year, market_segment, is_canceled) %>% filter(is_canceled == 0, ) %>% 
              group_by(hotel, arrival_date_year, market_segment) %>% summarise(bookings = n()) 

ggplot(data = hotel_mkseg[hotel_mkseg$hotel == "City Hotel",], aes(x = reorder(market_segment, bookings), y = bookings)) +
  geom_bar(stat = "identity", fill = "#00BFC4") +
  coord_flip() +
  scale_y_continuous(name="Total Bookings") +
  scale_x_discrete(name="Booking Channel") +
  ggtitle("City Hotel Bookings through year trend") +
  facet_wrap(~arrival_date_year)

ggplot(data = hotel_mkseg[hotel_mkseg$hotel == "Resort Hotel",], aes(x = reorder(market_segment, bookings), y = bookings)) +
  geom_bar(stat = "identity", fill = "#00BFC4") +
  coord_flip() +
  scale_y_continuous(name="Total Bookings") +
  scale_x_discrete(name="Booking Channel") +
  ggtitle("Resort Hotel Bookings through year trend") +
  facet_wrap(~arrival_date_year)

Over the years the online travel agencies have gained more popularity which is reflected in our data as the number of bookings increase on this channel.