| Members : Soumi Giri, Arvind Kurumbal Radhakrishnan, Samreen Pathan, Saksham Bhutani, Divyanshu Kumar |
Tourism has always being a very lucrative industry worth millions of dollars in US alone. Due to the pandemic in 2020 the tourism industry was one of the worst impacted sectors. But as the world is moving towards a new normal we are taking a look at what strategies can be devised to improve booking rates for hotels as well as improve the hotel revenues.
To infer the different marketing and tactical strategies to improve hotel bookings we are taking a look at the hotels data provided to us. The data gives us details of hotel bookings from 1st of July of 2015 to the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. We have 2 hotel types from the same chain in our data Resort Hotel and City Hotel for which we have details regarding bookings like arrival dates, family members, type of room, booking medium etc. Both the hotels are located in Portugal, The Resort Hotel in resort region in Algarve and the City Hotel in Lisbon. Using the data we are trying to identify actionable strategies which would help us increase the booking rates of the hotels while also trying to identify the customer preferences.
We have a host of information at our disposal including arrival date information, duration of stay, customer demographics (number of adults/children), booking channel information (travel agents/tour operators details), customer historic behavior (booking changes/cancellations) etc. These features will be analyzed against number of bookings/Average Daily Rate/number of cancellations. This data exploration exercise should help:
Following were the steps taken:
This analysis will give the management team of the Portugal based hotel chain, a good understanding of bookings and cancellation trends over time and across other variables like customer demographics and services provided by the hotel. The insights can also be considered as a generalized trend for the industry across similar geographical regions. In addition to gaining an understanding of prevalent trends, this analysis can also be the foundation for design of marketing and operational strategies.
The following packages are loaded to conduct our analysis
library(ggplot2)
library(dplyr)
library(knitr)
library(DT)
The following packages are used for our analysis:
Data used for our analysis can be found here.
The data format is mixed (raw and processed).The data is comprised of two types of hotels i.e. Resort Hotel (40,060 obs) located in Algarve and City Hotel (79,330 obs) located in Lisbon. The data was collected between July 1st, 2015 and August 31st, 2017. Since the data is real, hotel name and any customer identifiers are not included.The data has 32 variables which can be classified into the following categories:
Time based features - ArrivalDateDayOfMonth, ArrivalDateMonth, ArrivalDateWeekNumber, ArrivalDateYear, LeadTime, ReservationStatus, ReservationStatusDate
Customer Profile - Adults, Babies, Children, Country, CustomerType, RequiredCardParkingSpaces, PreviousCancellations, PreviousBookingsNotCanceled, TotalOfSpecialRequests
Booking Profile - AssignedRoomType, BookingChanges, DaysInWaitingList, IsCanceled, IsRepeatedGuest, LeadTime, ReservedRoomType, StaysInWeekendNights, StaysInWeekNights
Market Channels - Agent, Company, DistributionChannel, MarketSegment
Financials - ADR, DepositType
The data was collected for educational purposes and is focusing on revenue management. Data was obtained directly from the hotels’ PMS databases’ servers.The PMS assured no missing data exists in its database tables. However, in some categorical variables like Agent or Company, “NULL” is presented as one of the categories. This should not be considered a missing value, but rather as “not applicable”. For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.
The objective of the data is to give a better understanding of this market as multiple things can be explored by this data like customer segmentation, seasonality, cancellation prediction, etc.
A detailed data dictionary view - Link.
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
In our hotels data, we have 13 character variables, 18 numeric variables and 1 date variable.
str(hotels)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
summary(hotels)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
Type of room reserved vs assigned
table(hotels$reserved_room_type, hotels$assigned_room_type)
##
## A B C D E F G H I K L P
## A 73598 1123 1447 7548 1156 417 186 94 215 210 0 0
## B 111 988 0 5 2 2 8 0 0 2 0 0
## C 5 2 883 6 4 2 10 10 10 0 0 0
## D 312 27 34 17736 686 204 82 9 67 44 0 0
## E 15 3 8 22 5923 404 100 4 40 16 0 0
## F 6 17 0 4 31 2707 116 3 10 3 0 0
## G 5 2 2 0 4 14 2041 7 15 4 0 0
## H 0 0 0 1 0 0 10 584 6 0 0 0
## L 1 1 1 0 0 1 0 1 0 0 1 0
## P 0 0 0 0 0 0 0 0 0 0 0 12
Types of rooms which repeated guests prefer
table(hotels$is_repeated_guest, hotels$assigned_room_type)
##
## A B C D E F G H I K L P
## 0 71910 2112 2288 24411 7539 3631 2465 693 304 214 1 12
## 1 2143 51 87 911 267 120 88 19 59 65 0 0
Are guests more likely to return if room booked by the customer is assigned
table(hotels$is_repeated_guest,hotels$reserved_room_type == hotels$assigned_room_type)
##
## FALSE TRUE
## 0 13857 101723
## 1 1060 2750
Cancellations when correct room booked by the customer is assigned
table(hotels$is_canceled,hotels$reserved_room_type == hotels$assigned_room_type)
##
## FALSE TRUE
## 0 14115 61051
## 1 802 43422
Booking changes when room booked by the customer is assigned
table(hotels$booking_changes, hotels$reserved_room_type == hotels$assigned_room_type)
##
## FALSE TRUE
## 0 11028 90286
## 1 2886 9815
## 2 642 3163
## 3 195 732
## 4 80 296
## 5 36 82
## 6 16 47
## 7 9 22
## 8 6 11
## 9 3 5
## 10 2 4
## 11 1 1
## 12 1 1
## 13 3 2
## 14 3 2
## 15 2 1
## 16 0 2
## 17 2 0
## 18 0 1
## 20 1 0
## 21 1 0
Correct room assignments by hotel
table(hotels$hotel,hotels$reserved_room_type == hotels$assigned_room_type)
##
## FALSE TRUE
## City Hotel 7192 72138
## Resort Hotel 7725 32335
Bookings previously cancelled when room booked was room assigned
table(hotels$previous_cancellations > 0,hotels$reserved_room_type == hotels$assigned_room_type)
##
## FALSE TRUE
## FALSE 14724 98182
## TRUE 193 6291
table(hotels$previous_cancellations>1,hotels$is_canceled)
##
## 0 1
## FALSE 74961 43996
## TRUE 205 228
Booking distributions by customer type, deposit type and distribution channel used for booking across both the hotels
table(hotels$hotel,hotels$customer_type)
##
## Contract Group Transient Transient-Party
## City Hotel 2300 293 59404 17333
## Resort Hotel 1776 284 30209 7791
table(hotels$hotel,hotels$deposit_type)
##
## No Deposit Non Refund Refundable
## City Hotel 66442 12868 20
## Resort Hotel 38199 1719 142
table(hotels$hotel,hotels$distribution_channel)
##
## Corporate Direct GDS TA/TO Undefined
## City Hotel 3408 6780 193 68945 4
## Resort Hotel 3269 7865 0 28925 1
Majority booking in both hotels were made under transient customer type, with no deposit while booking, and through travel agents/tour operators.
Distribution of arrival_date_year
unique(hotels$arrival_date_year)
## [1] 2015 2016 2017
# Checking if all months are present in all years
hotels %>%
group_by(arrival_date_year) %>%
summarise(Count = n_distinct(arrival_date_month))
## # A tibble: 3 x 2
## arrival_date_year Count
## <dbl> <int>
## 1 2015 6
## 2 2016 12
## 3 2017 8
# Checking distinct dates in each month of each year
hotels %>%
group_by(arrival_date_year,arrival_date_month) %>%
summarise(Count = n_distinct(arrival_date_day_of_month))
## `summarise()` has grouped output by 'arrival_date_year'. You can override using the `.groups` argument.
## # A tibble: 26 x 3
## # Groups: arrival_date_year [3]
## arrival_date_year arrival_date_month Count
## <dbl> <chr> <int>
## 1 2015 August 31
## 2 2015 December 31
## 3 2015 July 31
## 4 2015 November 30
## 5 2015 October 31
## 6 2015 September 30
## 7 2016 April 30
## 8 2016 August 31
## 9 2016 December 31
## 10 2016 February 29
## # ... with 16 more rows
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
As mentioned earlier for the agent and company columns, the string “NULL” indicates that the booking wasn’t made through an agent/company. Below are the summary numbers for such bookings.
length(hotels$agent[hotels$agent == "NULL"])
## [1] 16340
length(hotels$company[hotels$company == "NULL"])
## [1] 112593
Agent column has 16340 (~13%) NULLs and Company has 112593 (~94%) NULLs. This indicates that these bookings were not made via any agent or company.
Rest of the columns do not have any NULLs or missing values.
We check for outliers for the following numeric variables
ADR
We observe that for ADR, all values lie within 550 except one which is way greater, 5400. We are going with an assumption that this is a data entry issue and treating it as an outlier, replacing it by 0.
boxplot(hotels$adr)
hotels$adr[hotels$adr=="5400"] <- 0
hist(hotels$adr, breaks = 50)
Days in waiting list
We observe that 96% bookings were confirmed within a day.
hist(hotels$days_in_waiting_list)
Booking Changes
We observe that for Booking Changes, 95% of the values lie between 0 and 2 and 99% between 0 and 4. There are only 267 such cases where there are more than 4 booking changes.
#Understand variable and visually inspect
table(hotels$booking_changes)
##
## 0 1 2 3 4 5 6 7 8 9 10
## 101314 12701 3805 927 376 118 63 31 17 8 6
## 11 12 13 14 15 16 17 18 20 21
## 2 2 5 5 3 2 2 1 1 1
#identify outliers
quantile(hotels$booking_changes,probs = c(0.005,0.995))
## 0.5% 99.5%
## 0 4
table(hotels$booking_changes > 4)
##
## FALSE TRUE
## 119123 267
#Prepped column
hist(hotels$booking_changes)
Previous cancellations
We observe that for Previous cancellations, 99% of the values lie between 0 and 1. There are only 317 such cases where there are more than 2 previous cancellations.
#Understand variable and visually inspect
table(hotels$previous_cancellations)
##
## 0 1 2 3 4 5 6 11 13 14 19
## 112906 6051 116 65 31 19 22 35 12 14 19
## 21 24 25 26
## 1 48 25 26
#identify outliers
quantile(hotels$previous_cancellations,probs = c(0.005,0.995))
## 0.5% 99.5%
## 0 1
table(hotels$previous_cancellations > 2)
##
## FALSE TRUE
## 119073 317
#Prepped column
hist(hotels$previous_cancellations)
Previous bookings not canceled
We observe that for Previous bookings not canceled, 99% of the values lie between 0 and 7. There are only 552 such cases where there are more than 7 non-cancellations.
#Understand variable and visually inspect
table(hotels$previous_bookings_not_canceled)
##
## 0 1 2 3 4 5 6 7 8 9 10
## 115770 1542 580 333 229 181 115 88 70 60 53
## 11 12 13 14 15 16 17 18 19 20 21
## 43 37 30 28 21 20 16 14 13 12 12
## 22 23 24 25 26 27 28 29 30 31 32
## 10 7 9 17 7 9 7 6 4 2 2
## 33 34 35 36 37 38 39 40 41 42 43
## 1 1 1 1 1 1 1 1 1 1 1
## 44 45 46 47 48 49 50 51 52 53 54
## 2 1 1 1 2 1 1 1 1 1 1
## 55 56 57 58 59 60 61 62 63 64 65
## 1 1 1 2 1 1 1 1 1 1 1
## 66 67 68 69 70 71 72
## 1 1 1 1 1 1 1
#identify outliers
quantile(hotels$previous_bookings_not_canceled,probs = c(0.005,0.995))
## 0.5% 99.5%
## 0 7
table(hotels$previous_bookings_not_canceled > 7)
##
## FALSE TRUE
## 118838 552
#Prepped column
hist(hotels$previous_bookings_not_canceled)
Lead Time
We observe that 3148 bookings in our hotels data has Lead Time of more than a year. From the below data we also see that Resort Hotel has lower average lead time than City Hotel.
#Understand variable and visually inspect
hist(hotels$lead_time)
#identify outliers
table(hotels$lead_time > 365)
##
## FALSE TRUE
## 116242 3148
#Checking lead time by hotel type
hotels %>%
group_by(hotel) %>%
summarise(average_lead_time = mean(lead_time),
min_lead_time = min(lead_time),
max_lead_time = max(lead_time))
## # A tibble: 2 x 4
## hotel average_lead_time min_lead_time max_lead_time
## <chr> <dbl> <dbl> <dbl>
## 1 City Hotel 110. 0 629
## 2 Resort Hotel 92.7 0 737
datatable(hotels[1:100,], caption = 'Table 1: Clean and tidy data.')
summary(hotels)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.56
## Mean : 2.321 Mean :101.79
## 3rd Qu.: 0.000 3rd Qu.:126.00
## Max. :391.000 Max. :510.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
From the 32 features present in our data we can broadly classify the features in the following categories:
We will identify insights and strategies on these 4 categories to understand the factors that influence a hotel booking and cancellation.
We have information regarding the lead time between booking and arrival, arrival date, month, year, duration of stay in the data. Leveraging these, the below hypothesis are proposed:
On initial exploration of number of bookings across years, we see a big jump between 2015 and 2016 (almost 30%), the jump between 2016 and 2017 is not as much. Please note that here we exclude the cancellations and consider the number of bookings per month to account for the availability of partial data across years. This might not the right approximation as we do not account for seasonality here, for instance maybe hotel bookings might spike during the holiday season and the 2017 data is only till the month of August. This is something we will explore while carrying out the detailed EDAs.
bookings_by_year <-
hotels %>%
filter(is_canceled == 0) %>%
group_by(arrival_date_year) %>%
summarise(bookings_per_month = n()/n_distinct(arrival_date_month))
ggplot(bookings_by_year,aes(x=arrival_date_year,y=bookings_per_month)) +
geom_bar(position = "dodge", stat = "identity", fill = "#00BFC4") +
geom_text(aes(label=paste(round(bookings_per_month,1))), position=position_dodge(width=0.9), vjust=-0.25) +
ggtitle("Bookings trends over the years")
To test the relationship between lead time and cancellation we use the below plot. We observe that a lot of cancellations are in the first few lead time buckets, which indicates people who have made the booking a while ago tend to follow through and check in.
hotels %>%
filter(is_canceled == 1) %>%
ggplot(aes(lead_time)) +
geom_histogram(col="black",fill="#00BFC4") +
labs(title="Cancelations by Lead Time", x="Lead Time", y="Count")
From the data provided we see there are few additional services that the customer can opt during booking their stay in the hotels. These services are - meal, car parking and other special request. From the hotel services that are present in the data few hypothesis we propose are:
To check the different hypothesis regarding hotel services we will check the booking and cancellation count for Resort Hotel and City Hotel against the different services provided. We will create an additional field called services_offered in our data set which will indicate if a hotel is providing any additional services to the customer.
hotels$services_offered <- ifelse((hotels$meal != "Undefined" | hotels$required_car_parking_spaces > 0 | hotels$total_of_special_requests > 0), "Yes", "No")
On initial exploration we observe that both Resort Hotel and City Hotel offer 1 or more services and higher number customers tend to opt for these services. We can see the distribution as below
hotel_services <- hotels %>% mutate(canceled = ifelse(is_canceled == 1, "yes", "no")) %>% group_by(hotel, services_offered, canceled) %>% summarise(bookings = n())
ggplot(data = hotel_services, aes(fill = canceled, x = services_offered, y = bookings)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Bookings and Cancellation by Hotel type and service offered") +
facet_wrap(~hotel) +
geom_text(aes(label=bookings), position=position_dodge(width=0.9), vjust=-0.25)
We can further check which hotels provide which services
service_table <- hotels %>% mutate(meal_service = ifelse(meal == "Undefined", "no", "yes"),
car_parking = ifelse(required_car_parking_spaces > 0, "yes", "no"),
special_request = ifelse(total_of_special_requests > 0, "yes", "no")) %>%
select(hotel, meal_service, car_parking, special_request) %>%
filter(meal_service == "yes", car_parking == "yes", special_request == "yes") %>%
distinct(hotel, meal_service, car_parking, special_request, .keep_all = TRUE)
kable(service_table, caption = "Service offering by hotel type")
| hotel | meal_service | car_parking | special_request |
|---|---|---|---|
| Resort Hotel | yes | yes | yes |
| City Hotel | yes | yes | yes |
From the above table we see both Resort Hotel and City Hotel provide all 3 types of additional services. Now that we have checked our data we check our stated hypothesis to derive actionable strategies.
After the initial exploration of the services data we now check the validity of the five proposed hypothesis
Customer opting for 1 or more services are less likely to cancel their reservations
We check the percent of customers who cancelled after opting for services
service_count <- hotels %>% mutate(meal_service = ifelse(meal == "Undefined", 0, 1),
car_parking = ifelse(required_car_parking_spaces > 0, 1, 0),
special_request = ifelse(total_of_special_requests > 0, 1, 0),
total_service = meal_service + car_parking + special_request) %>%
select(hotel, arrival_date_year, total_service, is_canceled) %>% group_by(hotel, arrival_date_year, total_service, is_canceled) %>%
summarise(bookings = n())
service_total <- service_count %>% group_by(hotel, arrival_date_year, total_service) %>% summarise(total = sum(bookings))
service_pctcancelled <- left_join(service_count, service_total, by = c("hotel", "arrival_date_year", "total_service")) %>% filter(is_canceled == 1) %>%
mutate(pct = (bookings/total) * 100) %>% select(hotel, arrival_date_year, total_service, pct)
ggplot(data = service_pctcancelled, aes(fill = as.character(arrival_date_year), x = total_service, y = pct)) +
geom_bar(position = "dodge", stat = "identity") +
ggtitle("% Cancellation by services opted") +
facet_wrap(~hotel) +
xlab("Number of services opted") +
ylab("% bookings cancelled") +
theme(legend.title = element_blank(), text = element_text(size = 10)) +
geom_text(aes(label=paste(round(pct,1), "%")), size = 2.5, position=position_dodge(width=0.9), vjust=-0.3)
From the above plot we can see all City Hotel booking have opted for 1 or more services. For City Hotel we see people who have opted for 2 services are less likely to cancel their reservation as the cancellation rate drops from 56.7% for people opting 1 service to 26.9% for people opting for 2 services in 2017.
For Resort Hotel we see the cancellation rate drops from 39.6% for people opting for 1 service to 28.3% for people opting for 2 services in 2017. In both case of Resort Hotel and City Hotel 0 cancellations were present for people opting for all 3 services. From the trends observed, our hypothesis hold true. We should promote more additional services to customers to make their booking more attractive.
To ensure anonymity in data the PMS system has removed any customer identifiers. We have few demographic data in our hotels data table - number of members, country of origin, type of deposit paid, etc. Based on these characteristics we propose the following key question which will help us identify factors influencing bookings and cancellations:
One of the key inferences that we get from the data is that majority of tourist who book the hotel stay are from Portugal which is the home country followed by tourists from Great Britain which can be seen below
hotels_country <- hotels %>% select(hotel, country, arrival_date_year, is_canceled) %>% filter(is_canceled == 0) %>%
group_by(country) %>% summarise(bookings = n()) %>% arrange(desc(bookings))
ggplot(data = hotels_country[1:10,], aes(x = reorder(country, bookings), y = bookings)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
coord_flip() +
scale_y_continuous(name="Total Bookings") +
scale_x_discrete(name="Country of origin") +
ggtitle("Top 10 countries of origin")
To check the hypothesis related to family we need to come up with a logic which would qualify an entry as family. For that, an entry with a non-negative count of either babies or children and a non-negative count of an adult would qualify as a family and is named is_family.A new variable named booking_duration is created that will indicate the number of days a customer stayed. We’ll sum stays_in_weekend_nights and stays_in_week_nights to compute this.
is_family <- ((hotels$babies> 0 | hotels$children > 0) & hotels$adults > 0)
hotels$is_family <- is_family
booking_duration <- hotels$stays_in_weekend_nights + hotels$stays_in_week_nights
hotels$booking_duration <- booking_duration
People who do not pay deposit while booking tend to cancel the booking
To test the above hypothesis we look at the trend of cancellations across deposit type and check the trend.
hotels$is_canceled <- as.character(hotels$is_canceled)
deposit_canc <- ggplot(data=hotels, aes(x=factor(deposit_type), fill=is_canceled)) +
geom_bar(stat="count", position=position_dodge()) +
labs(title = "Bookings by deposit type", x = "Deposit Type", y = "Bookings")
deposit_canc
This shows us the most number of booking which get cancelled are where people have made no deposit for the booking. But this also gives us another interesting finding that the cancellation rate across booking with non refund is higher as compared to not cancelled bookings.
This helps us deduce that, the rate of overall cancelled bookings (bookings cancelled/booking cancelled+not cancelled) is much higher in case of Non-refund deposit type as compared to the bookings where deposit is not paid.
The customers who are part of a family tend to have a longer stay
The following plot represents the count of customers who belong to a family and stay for more than 2 days or less than 2 days.
hotels %>%
filter(is_family == "TRUE") %>%
ggplot(aes(booking_duration)) +
geom_histogram(col="black", fill="#00BFC4") +
labs(title="Family by Duration", x="Booking Duration", y="Count")
The plot indicates that our hypothesis that families tend to stay longer is correct as the majority of the entries has a value greater than 2. We now check the rest of the hypothesis.
People with more family members tend to pay deposits while booking
We plan to filter the number of bookings with family members > 2 and see if these booking were made with deposit being paid while booking or not. This plot will help us accept or reject the hypothesis.
The least amount of cancellations occur in the booking types which are transient
We will plot the volume of bookings for customer type against is cancelled variable to deduce how many bookings are cancelled/not cancelled under each customer type. This will help us accept or reject the hypothesis.
The booking profile of the customer in consonance with the other feature types can be used to derive holistic insights from the data. Following hypothesis will be tested:
We create a new variable that tracks the ability of the hotel to fulfil the booking by assiging the reserved room to its customer.
hotels$same_room <- (hotels$reserved_room_type == hotels$assigned_room_type)
We can now visually inspect the data to test the validity of our hypothesis.
To test and visualize the relationship between the booking changes and the lead time we use the following plot. Here we can see that there are a greater number of changes in bookings that were booked closer to the arrival date.
hotels %>%
filter(booking_changes > 0) %>%
ggplot(aes(lead_time)) +
geom_histogram(col = "black", fill="#00BFC4") +
labs(title="Booking changes by Lead Time", x="Lead Time", y="Count")
The market channel metrics conveys the booking channel that was used by the customers for making the reservation. Our market channel metrics contain - Agent, Company, Distribution Channel and Market Segment. Based on our initial exploration we put forward the following hypothesis to gauge the impact of the booking channel on customer bookings:
On a overall level we see that customers prefer to make reservations through online travel agencies as we can see in the below plots for both city Hotel and Resort Hotel.
hotel_mkseg <- hotels %>% select(hotel, arrival_date_year, market_segment, is_canceled) %>% filter(is_canceled == 0, ) %>%
group_by(hotel, arrival_date_year, market_segment) %>% summarise(bookings = n())
ggplot(data = hotel_mkseg[hotel_mkseg$hotel == "City Hotel",], aes(x = reorder(market_segment, bookings), y = bookings)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
coord_flip() +
scale_y_continuous(name="Total Bookings") +
scale_x_discrete(name="Booking Channel") +
ggtitle("City Hotel Bookings through year trend") +
facet_wrap(~arrival_date_year)
ggplot(data = hotel_mkseg[hotel_mkseg$hotel == "Resort Hotel",], aes(x = reorder(market_segment, bookings), y = bookings)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
coord_flip() +
scale_y_continuous(name="Total Bookings") +
scale_x_discrete(name="Booking Channel") +
ggtitle("Resort Hotel Bookings through year trend") +
facet_wrap(~arrival_date_year)
Over the years the online travel agencies have gained more popularity which is reflected in our data as the number of bookings increase on this channel.
Based on the above data exploration we now check our data for trends to validate our proposed hypothesis to devise strategies to hotel bookings.
The bookings made through travel agencies tend to not get cancelled
In order to test the validity of the first hypothesis proposed, we will filter the bookings made through travel agencies from the agency_type and plot these booking against the cancellation flag is_canceled. This plot will help us accept or reject the hypothesis.
People with more family members tend to make bookings with travel agencies
For the second hypothesis we will look at the number of bookings with family members > 2 and see if these bookings are majorly made through an agency or no using the agency_type variable. This plot will help us accept or reject the hypothesis.
The bookings made through companies tend to get cancelled more
For the third hypothesis we will look at the bookings made through companies using company variable and plot the cancelled vs not cancelled booking against the companies, this will help us determine the validity of our hypothesis.