| Members : Soumi Giri, Arvind Kurumbal Radhakrishnan, Samreen Pathan, Saksham Bhutani, Divyanshu Kumar |
Tourism has always being a very lucrative industry worth millions of dollars in US alone. Due to the pandemic in 2020 the tourism industry was one of the worst impacted sectors. But as the world is moving towards a new normal we are taking a look at what strategies can be devised to improve booking rates for hotels as well as improve the hotel revenues.
To infer the different marketing and tactical strategies to improve hotel bookings we are taking a look at the hotels data provided to us. The data gives us details of hotel bookings from 1st of July of 2015 to the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. We have 2 hotel types from the same chain in our data Resort Hotel and City Hotel for which we have details regarding bookings like arrival dates, family members, type of room, booking medium etc. Both the hotels are located in Portugal, The Resort Hotel in resort region in Algarve and the City Hotel in Lisbon. Using the data we are trying to identify actionable strategies which would help us increase the booking rates of the hotels while also trying to identify the customer preferences.
We have a host of information at our disposal including arrival date information, duration of stay, customer demographics (number of adults/children), booking channel information (travel agents/tour operators details), customer historic behavior (booking changes/cancellations) etc. These features will be analyzed against number of bookings/Average Daily Rate/number of cancellations. This data exploration exercise should help:
Following were the steps taken:
This analysis will give the management team of the Portugal based hotel chain, a good understanding of bookings and cancellation trends over time and across other variables like customer demographics and services provided by the hotel. The insights can also be considered as a generalized trend for the industry across similar geographical regions. In addition to gaining an understanding of prevalent trends, this analysis can also be the foundation for design of marketing and operational strategies.
The following packages are loaded to conduct our analysis
library(tidyverse)
library(ggplot2)
library(dplyr)
library(knitr)
library(DT)
library(plotly)
library(viridis)
library(wordcloud)
library(RColorBrewer)
library(forcats)
The following packages are used for our analysis:
Data used for our analysis can be found here.
The data format is mixed (raw and processed).The data is comprised of two types of hotels i.e. Resort Hotel (40,060 obs) located in Algarve and City Hotel (79,330 obs) located in Lisbon. The data was collected between July 1st, 2015 and August 31st, 2017. Since the data is real, hotel name and any customer identifiers are not included.The data has 32 variables which can be classified into the following categories:
Time Based Features - Gives us when a booking was made, when the booking party arrived and how long the stay was. ArrivalDateDayOfMonth, ArrivalDateMonth, ArrivalDateWeekNumber, ArrivalDateYear, LeadTime etc
Customer Demography Features - Characteristics of the customers who had made the booking. Adults, Babies, Children, Country etc
Booking Profile - These columns provide us a specific characteristics of a customer’s booking. AssignedRoomType, BookingChanges, DaysInWaitingList, IsRepeatedGuest etc
Market Channels - Different channels through which the reservation was made. Agent, Company, DistributionChannel, MarketSegment.
Hotel Services - Different additional services that the hotel provides. meal, required_car_parking_spaces, total_of_special_requests.
The data was collected for educational purposes and is focusing on revenue management. Data was obtained directly from the hotels’ PMS databases’ servers.The PMS assured no missing data exists in its database tables. However, in some categorical variables like Agent or Company, “NULL” is presented as one of the categories. This should not be considered a missing value, but rather as “not applicable”. For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.
The objective of the data is to give a better understanding of this market as multiple things can be explored by this data like customer segmentation, seasonality, cancellation prediction, etc.
A detailed data dictionary view - Link.
hotels <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
In our hotels data, we have 13 character variables, 18 numeric variables and 1 date variable.
str(hotels)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
Below we observe the summary statistics for all ournumerics columns.
num_cols <- unlist(lapply(hotels, is.numeric))
summary(hotels[, num_cols])
## is_canceled lead_time arrival_date_year arrival_date_week_number
## Min. :0.0000 Min. : 0 Min. :2015 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016 1st Qu.:16.00
## Median :0.0000 Median : 69 Median :2016 Median :28.00
## Mean :0.3704 Mean :104 Mean :2016 Mean :27.17
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017 3rd Qu.:38.00
## Max. :1.0000 Max. :737 Max. :2017 Max. :53.00
##
## arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights
## Min. : 1.0 Min. : 0.0000 Min. : 0.0
## 1st Qu.: 8.0 1st Qu.: 0.0000 1st Qu.: 1.0
## Median :16.0 Median : 1.0000 Median : 2.0
## Mean :15.8 Mean : 0.9276 Mean : 2.5
## 3rd Qu.:23.0 3rd Qu.: 2.0000 3rd Qu.: 3.0
## Max. :31.0 Max. :19.0000 Max. :50.0
##
## adults children babies is_repeated_guest
## Min. : 0.000 Min. : 0.0000 Min. : 0.000000 Min. :0.00000
## 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.000000 1st Qu.:0.00000
## Median : 2.000 Median : 0.0000 Median : 0.000000 Median :0.00000
## Mean : 1.856 Mean : 0.1039 Mean : 0.007949 Mean :0.03191
## 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.000000 3rd Qu.:0.00000
## Max. :55.000 Max. :10.0000 Max. :10.000000 Max. :1.00000
## NA's :4
## previous_cancellations previous_bookings_not_canceled booking_changes
## Min. : 0.00000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.00000 Median : 0.0000 Median : 0.0000
## Mean : 0.08712 Mean : 0.1371 Mean : 0.2211
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000 Max. :21.0000
##
## days_in_waiting_list adr required_car_parking_spaces
## Min. : 0.000 Min. : -6.38 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.: 69.29 1st Qu.:0.00000
## Median : 0.000 Median : 94.58 Median :0.00000
## Mean : 2.321 Mean : 101.83 Mean :0.06252
## 3rd Qu.: 0.000 3rd Qu.: 126.00 3rd Qu.:0.00000
## Max. :391.000 Max. :5400.00 Max. :8.00000
##
## total_of_special_requests
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.5714
## 3rd Qu.:1.0000
## Max. :5.0000
##
Days in waiting list
We observe that 96% bookings were confirmed within a day.
Booking Changes
We observe that for Booking Changes, 95% of the values lie between 0 and 2, 99% between 0 and 4. There are only 267 (0.2%) such cases where there are more than 4 booking changes.
Lead Time
We observe that 3148 bookings in our hotels data has Lead Time of more than a year. From the below data we also see that Resort Hotel has lower average lead time than City Hotel.
hist(hotels$days_in_waiting_list,
main="Distribution of days in waiting list",
xlab = "Days in waiting list",
col = "#00BFC4")
hist(hotels$booking_changes,
main="Distribution of Booking changes",
xlab = "Booking changes",
ylim=c(0,80000),
col = "#00BFC4")
hist(hotels$lead_time,
main="Distribution of Lead time",
xlab = "Lead time",
col = "#00BFC4")
Previous cancellations
We observe that for Previous cancellations, 99% of the values lie between 0 and 1. There are only 317 such cases where there are more than 2 previous cancellations.
Previous bookings not canceled
We observe that for Previous bookings not canceled, 99% of the values lie between 0 and 7. There are only 552 such cases where there are more than 7 non-cancellations.
#Understand variable and visually inspect
hist(hotels$previous_cancellations,
main="Distribution of Previous Cancellation",
xlab = "Previous Cancellations",
col = "#00BFC4")
hist(hotels$previous_bookings_not_canceled,
main="Distribution of Non Cancelled Bookings",
xlab = "Previous non cancelled bookings",
col = "#00BFC4")
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
As mentioned earlier for the agent and company columns, the string “NULL” indicates that the booking wasn’t made through an agent/company. Below are the summary numbers for such bookings.
length(hotels$agent[hotels$agent == "NULL"])
## [1] 16340
length(hotels$company[hotels$company == "NULL"])
## [1] 112593
Agent column has 16340 (~13%) NULLs and Company has 112593 (~94%) NULLs. This indicates that these bookings were not made via any agent or company.
Rest of the columns do not have any NULLs or missing values.
We check for outliers for the following numeric variables
ADR
We observe that for ADR, all values lie within 550 except one which is way greater, 5400. We are going with an assumption that this is a data entry issue and treating it as an outlier, replacing it by 0.
boxplot(hotels$adr)
After treating the outlier the distribution looks like:
hotels$adr[hotels$adr=="5400"] <- 0
hist(hotels$adr,
main="Distribution of ADR",
xlab = "ADR",
col = "#00BFC4", breaks = 50)
datatable(hotels[1:100,], caption = 'Table 1: Clean and tidy data.')
From the 32 features present in our data we can broadly classify the features in the following categories:
We will identify insights and strategies on these 4 categories to understand the factors that influence a hotel booking and cancellation.
We have information regarding the lead time between booking and arrival, arrival date, month, year, duration of stay in the data. Leveraging these, the below hypothesis will be checked:
Relation between lead time and cancellation, i.e. are bookings that were made a long time back canceled more often or are recently made bookings canceled more often ?
Relation between lead time and arrival month, i.e. for which months do guests book a long time in advance ?
Which months/weeks have more bookings ?
Which days of each month have more bookings ? Are there certain days in certain months where bookings spike, for example the Dec 25 - Jan 1st week? This combined with the previous analysis can be used for dynamic pricing.
On initial exploration of number of bookings across years, we see a big jump between 2015 and 2016 (almost 30%), the jump between 2016 and 2017 is not as much. Please note that here we exclude the cancellations and consider the number of bookings per month to account for the availability of partial data across years. This might not the right approximation as we do not account for seasonality here, for instance maybe hotel bookings might spike during the holiday season and the 2017 data is only till the month of August. This is something we will explore while carrying out the detailed EDAs.
bookings_by_year <-
hotels %>%
filter(is_canceled == 0) %>%
group_by(arrival_date_year) %>%
summarise(bookings_per_month = n()/n_distinct(arrival_date_month))
ggplot(bookings_by_year,aes(x=arrival_date_year,y=bookings_per_month)) +
geom_bar(position = "dodge", stat = "identity", fill = "#88ddbb") +
geom_text(aes(label=paste(round(bookings_per_month,1)))) +
xlab("Arrival Year") +
ylab("Average Bookings per Month") +
ggtitle("Bookings per month across years")
Relation between lead time and cancellation, i.e. are bookings that were made a long time back canceled more often or are recently made bookings canceled more often ?
To test the relationship between lead time and cancellation we use the below plot. We observe that a lot of cancellations are in the first few lead time buckets. But this can be misleading because a lot of the bookings are also in the first few lead time buckets. To get a better understanding we plot % cancellations by lead time bins and here we get the insight that cancellation rates increase as lead time increases. Therefore, people who have made the booking a few days before checking in are less likely to cancel.
hotels %>%
filter(is_canceled == 1) %>%
ggplot(aes(lead_time)) +
geom_histogram(col="black",fill="#00BFC4") +
labs(title="Cancelations by Lead Time", x="Lead Time", y="Count")
hotels %>%
ggplot(aes(lead_time)) +
geom_histogram(col="black",fill="#00BFC4") +
labs(title="Bookings by Lead Time", x="Lead Time", y="Count")
hotels %>%
mutate(lead_time_bucket = case_when(lead_time >= 0 & lead_time <= 200 ~ "0-200",
lead_time > 200 & lead_time <= 400 ~ "200-400",
lead_time > 400 & lead_time <= 600 ~ "400-600",
lead_time > 600 ~ ">600"),
lead_time_bucket = fct_relevel(lead_time_bucket,
"0-200","200-400","400-600",">600")) %>%
group_by(lead_time_bucket) %>%
summarise(cancellation_percent = sum(is_canceled == 1)/n()) %>%
ggplot(aes(x=lead_time_bucket,y=cancellation_percent*100)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
xlab("Lead Time") +
ylab("% Cancellations") +
ggtitle("Lead Time vs % Cancellations")
Relation between lead time and arrival month, i.e. for which months do guests book a long time in advance ?
For bookings which weren’t canceled, the highest average lead time is for bookings where the arrival month is July and June.
leadtime_by_month <-
hotels %>%
filter(is_canceled == 0) %>%
mutate(arrival_date_month_short = str_sub(arrival_date_month,1,3),
arrival_date_month_short = fct_relevel(arrival_date_month_short,
"Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec")) %>%
group_by(arrival_date_month_short) %>%
summarise(average_lead_time = mean(lead_time))
ggplot(leadtime_by_month,aes(x=arrival_date_month_short,y=average_lead_time)) +
geom_bar(position = "dodge", stat = "identity", fill = "#00BFC4") +
xlab("Arrival Month") +
ylab("Average Lead Time") +
ggtitle("Lead time by Month of Arrival")
Which months/weeks have more bookings ?
To check the number of bookings across weeks we plot the below line chart. We can infer that the number of bookings peak around week 20-25 and then again between 40-45. As a follow up we will replicate the same analysis separately for resort hotel and city hotel.
bookings_by_week <-
hotels %>%
filter(is_canceled == 0) %>%
group_by(arrival_date_year,arrival_date_week_number) %>%
summarise(bookings_per_week = n())
ggplot(bookings_by_week,aes(x=arrival_date_week_number,y=bookings_per_week,
group=arrival_date_year)) +
geom_line(aes(color=as.character(arrival_date_year))) +
ggtitle("Booking trends across weeks")+
xlab("Arrival Week Number") +
ylab("Bookings Count") +
guides(color=guide_legend(title="Arrival Year"))
The trends are slightly different when we filter for resort hotels. When we look at 2017 alone, the peak occurred around week 10. For 2016, the 2 peaks are seen around week 20 and 40, which is consistent with the trend observed in the previous chart.
bookings_by_week <-
hotels %>%
filter(is_canceled == 0 & hotel == 'Resort Hotel') %>%
group_by(arrival_date_year,arrival_date_week_number) %>%
summarise(bookings_per_week = n())
ggplot(bookings_by_week,aes(x=arrival_date_week_number,y=bookings_per_week,
group=arrival_date_year)) +
geom_line(aes(color=as.character(arrival_date_year))) +
ggtitle("Resort hotels - Booking trends across weeks") +
xlab("Arrival Week Number") +
ylab("Bookings Count") +
guides(color=guide_legend(title="Arrival Year"))
City hotel trends are consistent with what we saw in the overall weekly trends, i.e. peaks around weeks 20 and 40.
bookings_by_week <-
hotels %>%
filter(is_canceled == 0 & hotel == 'City Hotel') %>%
group_by(arrival_date_year,arrival_date_week_number) %>%
summarise(bookings_per_week = n())
ggplot(bookings_by_week,aes(x=arrival_date_week_number,y=bookings_per_week,
group=arrival_date_year)) +
geom_line(aes(color=as.character(arrival_date_year))) +
ggtitle("City hotels - Booking trends across weeks") +
xlab("Arrival Week Number") +
ylab("Bookings Count") +
guides(color=guide_legend(title="Arrival Year"))
Which days of each month have more bookings ? Are there certain days in certain months where bookings spike, for example the Dec 25 - Jan 1st week? This combined with the previous analysis can be used for dynamic pricing.
Below is a heatmap of number of bookings by month vs day. For ease of interpretation we have created a diverging color palette. Below are a few insights:
bookings_by_month_day <-
hotels %>%
filter(is_canceled == 0 & arrival_date_year == 2016) %>%
mutate(arrival_date_month = fct_relevel(arrival_date_month,
"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December")) %>%
group_by(arrival_date_month,arrival_date_day_of_month) %>%
summarise(bookings_per_day = n())
ggplot(bookings_by_month_day,aes(x=arrival_date_day_of_month,y=arrival_date_month,fill=bookings_per_day)) +
geom_tile() +
ggtitle("Heatmap - 2016 Day of Month vs Month") +
scale_fill_distiller(palette = 'PiYG') +
xlab("Arrival Date") +
ylab("Arrival Month")
From the data provided we see there are few additional services that the customer can opt during booking their stay in the hotels. These services are - meal, car parking and other special request. From the hotel services that are present in the data few hypothesis we propose are:
To check the different hypothesis regarding hotel services we will check the booking and cancellation count for Resort Hotel and City Hotel against the different services provided. We will create an additional field called services_offered in our data set which will indicate if a hotel is providing any additional services to the customer.
hotels$services_offered <- ifelse((hotels$meal != "Undefined" | hotels$required_car_parking_spaces > 0 | hotels$total_of_special_requests > 0), "Yes", "No")
On initial exploration we observe that both Resort Hotel and City Hotel offer 1 or more services and higher number customers tend to opt for these services. We can see the distribution as below
hotel_services <- hotels %>% mutate(canceled = ifelse(is_canceled == 1, "yes", "no")) %>% group_by(hotel, services_offered, canceled) %>% summarise(bookings = n())
ggplot(data = hotel_services, aes(fill = canceled, x = services_offered, y = bookings)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Bookings and Cancellation by Hotel type and service offered") +
facet_wrap(~hotel) +
geom_text(aes(label=bookings), position=position_dodge(width=0.9), vjust=-0.25)
We can further check which hotels provide which services
service_table <- hotels %>% mutate(meal_service = ifelse(meal == "Undefined", "no", "yes"),
car_parking = ifelse(required_car_parking_spaces > 0, "yes", "no"),
special_request = ifelse(total_of_special_requests > 0, "yes", "no")) %>%
select(hotel, meal_service, car_parking, special_request) %>%
filter(meal_service == "yes", car_parking == "yes", special_request == "yes") %>%
distinct(hotel, meal_service, car_parking, special_request, .keep_all = TRUE)
kable(service_table, caption = "Service offering by hotel type")
| hotel | meal_service | car_parking | special_request |
|---|---|---|---|
| Resort Hotel | yes | yes | yes |
| City Hotel | yes | yes | yes |
From the above table we see both Resort Hotel and City Hotel provide all 3 types of additional services. Now that we have checked our data we check our stated hypothesis to derive actionable strategies.
After the initial exploration of the services data we now check the validity of the five proposed hypothesis
Customer opting for 1 or more services are less likely to cancel their reservations
We check the percent of customers who cancelled after opting for services
service_count <- hotels %>% mutate(meal_service = ifelse(meal == "Undefined", 0, 1),
car_parking = ifelse(required_car_parking_spaces > 0, 1, 0),
special_request = ifelse(total_of_special_requests > 0, 1, 0),
total_service = meal_service + car_parking + special_request) %>%
select(hotel, arrival_date_year, total_service, is_canceled) %>% group_by(hotel, arrival_date_year, total_service, is_canceled) %>%
summarise(bookings = n())
service_total <- service_count %>% group_by(hotel, arrival_date_year, total_service) %>% summarise(total = sum(bookings))
service_pctcancelled <- left_join(service_count, service_total, by = c("hotel", "arrival_date_year", "total_service")) %>% filter(is_canceled == 1) %>%
mutate(pct = (bookings/total) * 100) %>% select(hotel, arrival_date_year, total_service, pct)
ggplot(data = service_pctcancelled, aes(fill = as.character(arrival_date_year), x = total_service, y = pct)) +
geom_bar(position = "dodge", stat = "identity") +
ggtitle("% Cancellation by services opted") +
facet_wrap(~hotel) +
xlab("Number of services opted") +
ylab("% bookings cancelled") +
theme(legend.title = element_blank(), text = element_text(size = 10)) +
geom_text(aes(label=paste(round(pct,1), "%")), size = 2.5, position=position_dodge(width=0.9), vjust=-0.3)
From the above plot we can see all City Hotel booking have opted for 1 or more services. For City Hotel we see people who have opted for 2 services are less likely to cancel their reservation as the cancellation rate drops from 56.7% for people opting 1 service to 26.9% for people opting for 2 services in 2017.
For Resort Hotel we see the cancellation rate drops from 39.6% for people opting for 1 service to 28.3% for people opting for 2 services in 2017. In both case of Resort Hotel and City Hotel 0 cancellations were present for people opting for all 3 services. From the trends observed, our hypothesis hold true. We should promote more additional services to customers to make their booking more attractive.
Bookings with higher group members and families are more likely to opt for services
hotels %>% mutate(family_size = adults + children + babies,
meal_service = ifelse(meal == "Undefined", 0, 1),
total_service = meal_service + required_car_parking_spaces + total_of_special_requests) %>% filter(is.na(family_size) == FALSE, is_canceled == 0) %>%
group_by(hotel, family_size) %>% summarise(bookings = n()) %>%
ggplot(aes(x = reorder(family_size, bookings), y = bookings)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
facet_wrap(~hotel)+
coord_flip() +
scale_y_continuous(name="Total Bookings") +
scale_x_discrete(name="Family Size") +
ggtitle("Fig a: Total Bookings by Family Size")
hotels %>% mutate(family_size = adults + children + babies,
meal_service = ifelse(meal == "Undefined", 0, 1),
total_service = meal_service + required_car_parking_spaces + total_of_special_requests) %>% filter(is.na(family_size) == FALSE, is_canceled == 0) %>%
group_by(hotel, family_size) %>% summarise(avg_services = mean(total_service, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(family_size, family_size), y = avg_services)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
facet_wrap(~hotel)+
coord_flip() +
scale_y_continuous(name="Average Services") +
scale_x_discrete(name="Family Size") +
ggtitle("Fig b: Average Service Opted by Family Size")
From the above chart we notice that majority of the bookings are done family size of 2 for both resort hotel and city hotel. From the Fig b we note families of size 3 are more likely to opt for more services than other groups. To improve the booking rates we can have deals and packages for services for families with 3 or more.
Families with kids are more likely to opt for greater service during their stay
data2 <- hotels %>% mutate(tot_children = ifelse(children + babies > 0, "yes", "no"),
family_size = adults + children + babies,
meal_service = ifelse(meal == "Undefined", 0, 1),
total_service = meal_service + required_car_parking_spaces + total_of_special_requests,
stay_duration = stays_in_week_nights + stays_in_weekend_nights) %>%
filter(is_canceled == 0, total_service > 0, family_size > 0) %>% group_by(hotel, family_size, tot_children) %>% summarise(booking = n())
ggplot(data = data2[data2$hotel == "City Hotel",], aes(x = reorder(as.character(family_size), family_size), y = booking, fill = tot_children)) +
geom_col(position = "fill") +
scale_y_continuous(name="Percentage bookings") +
xlab("Family Size") +
ggtitle("Fig a: Services opted by family with kids vs no kids in City hotel") +
labs(fill = "Have Children")
ggplot(data = data2[data2$hotel == "Resort Hotel",], aes(x = reorder(as.character(family_size), family_size), y = booking, fill = tot_children)) +
geom_col(position = "fill") +
scale_y_continuous(name="Percentage bookings") +
xlab("Family Size") +
ggtitle("Fig b: Services opted by family with kids vs no kids in Resort hotel")+
labs(fill = "Have Children")
From the above stacked barplot in fig a we observe that in City hotel family sizes of more than 3 tend to have children, but for family sizes 3 or less people without children are more likely to opt for services. Whereas in Resort hotel the maximum group size is 5 and groups of 3 or more tends to have kids with them and they opt for more services. To further understand the number of services opted by families with kids to explore fig c and fig d.
fam_kids <- hotels %>% mutate(tot_children = ifelse(children + babies > 0, "yes", "no"),
family_size = adults + children + babies,
meal_service = ifelse(meal == "Undefined", 0, 1),
total_service = meal_service + required_car_parking_spaces + total_of_special_requests,
stay_duration = stays_in_week_nights + stays_in_weekend_nights) %>%
filter(is_canceled == 0) %>% group_by(hotel, total_service, family_size, tot_children) %>% summarise(booking = n()) %>% filter(tot_children == "yes")
ggplot(data = fam_kids[fam_kids$hotel == "City Hotel",], aes(x = reorder(as.character(family_size), family_size), y = booking, fill = as.character(total_service))) +
geom_col(position = "fill") +
scale_y_continuous(name="Percentage bookings") +
xlab("Family Size") +
ggtitle("Fig c: Services opted by family with Kids in City hotel") +
labs(fill = "Number of Services")
ggplot(data = fam_kids[fam_kids$hotel == "Resort Hotel",], aes(x = reorder(as.character(family_size), family_size), y = booking, fill = as.character(total_service))) +
geom_col(position = "fill") +
scale_y_continuous(name="Percentage bookings") +
xlab("Family Size") +
ggtitle("Fig d: Services opted by family with Kids in Resort hotel") +
labs(fill = "Number of Services")
From fig c we notice most families with kids tend to opt for 1 or 2 services, but family sizes of 3 tend to opt for higher number of services than 2 in City hotels. From fig d a similar trend is being followed in resort hotel.
Customers of Resort Hotel are more likely to opt for special service requests than City Hotel Customers
other_services <- hotels %>% filter(is_canceled == 0) %>% group_by(hotel) %>% summarise(avg_special_service = mean(total_of_special_requests))
ggplot(data = other_services, aes(x = hotel, y = avg_special_service)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
ggtitle("Average specials services by Hotel type") +
xlab("") +
ylab("Average Special Services") +
geom_text(aes(label= round(avg_special_service,2)), size = 2.5, position=position_dodge(width=0.9), vjust=-0.3)
From the above plot we notice our proposed hypothesis doesn’t hold true. On contrary people are opting for more special services in city hotel than in resort hotel. In our hotel data we don’t have much details regarding the type of special services request by customers. Based on the observation we can propose to provide deals on special services to City hotel customer to boost booking rate.
During the holiday seasons number of special requests increases for both resort and city hotels
services_holiday <- hotels %>% filter(is_canceled == 0) %>%
mutate(month_date = lubridate::mdy(paste(arrival_date_month, "01",arrival_date_year, sep = "/")),
month = match(arrival_date_month, month.name)) %>%
group_by(hotel, arrival_date_year, month) %>%
summarise(special_req = sum(total_of_special_requests),
avg_special_req = mean(total_of_special_requests),
bookings = n())
ggplot(data = services_holiday, aes(x = month, y = special_req, group = hotel)) +
geom_line(aes(color = hotel)) +
facet_wrap(~arrival_date_year) +
scale_x_discrete("Months", limits = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")) +
ggtitle("Total specials services by Month") +
xlab("Month-Year") +
ylab("Total Special Services")
ggplot(data = services_holiday, aes(x = month, y = bookings, group = hotel)) +
geom_line(aes(color = hotel)) +
facet_wrap(~arrival_date_year) +
scale_x_discrete("Months", limits = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")) +
ggtitle("Total bookings by Month") +
xlab("Month-Year") +
ylab("Total Bookings")
From the above line chart we notice the number of special request actually spikes from the month of July and August and then we see a smaller spike in the holiday season. This can be corroborated with the number of bookings made which also spikes during these months every year as can be seen from the above 2 plots.
To improve our booking rates we can provide special offers on such services during the peak booking months.
Bed and Breakfast is the most popular meal type opted by customers in both types of Hotel
meal_type <- hotels %>% filter(is_canceled == 0) %>% group_by(hotel, meal) %>% summarise(bookings = n())
total_booking <- hotels %>% filter(is_canceled == 0) %>% group_by(hotel) %>% summarise(bookings_tot = n())
meal_type_pct <- left_join(meal_type, total_booking, by = "hotel")
ggplot(data = meal_type_pct, aes(x= reorder(meal, bookings), y = (bookings/bookings_tot))) +
geom_bar(stat = "identity", fill = "#00BFC4") +
facet_wrap(~hotel)+
coord_flip() +
scale_y_continuous(name="Number of bookings") +
scale_x_discrete(name="Meal type") +
ggtitle("Popular meal type by hotels")
From the above plots we note the most popular meal type for both city hotel and resort hotel is bed and breakfast. Another interesting observation from the above chart is that for City hotel we note that about 21% of the customers opt for no meal services whereas in resort hotel we note that most customers opt either for bed and breakfast package or half board package which includes breakfast and one other meal – usually dinner.
From this we can strategize to provide more deals on meal packages to resort customers to boost booking rates.
To ensure anonymity in data the PMS system has removed any customer identifiers. We have few demographic data in our hotels data table - number of members, country of origin, type of deposit paid, etc. Based on these characteristics we propose the following key question which will help us identify factors influencing bookings and cancellations:
One of the key inferences that we get from the data is that majority of tourist who book the hotel stay are from Portugal which is the home country followed by tourists from Great Britain which can be seen below
hotels_country <- hotels %>% select(hotel, country, arrival_date_year, is_canceled) %>% filter(is_canceled == 0) %>%
group_by(country) %>% summarise(bookings = n()) %>% arrange(desc(bookings))
wordcloud(words = hotels_country$country, freq = hotels_country$bookings, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
To check the hypothesis related to family we need to come up with a logic which would qualify an entry as family. For that, an entry with a non-negative count of either babies or children and a non-negative count of an adult would qualify as a family and is named is_family.A new variable named booking_duration is created that will indicate the number of days a customer stayed. We’ll sum stays_in_weekend_nights and stays_in_week_nights to compute this.
is_family <- ((hotels$babies> 0 | hotels$children > 0) & hotels$adults > 0)
hotels$is_family <- is_family
booking_duration <- hotels$stays_in_weekend_nights + hotels$stays_in_week_nights
hotels$booking_duration <- booking_duration
Making a new variable that includes countries that had the most visitors. This has top 6 countries named and rest of the countries as others.
hotels$new_country <- hotels$country
hotels$new_country[hotels$new_country != "PRT" & hotels$new_country != "GBR"
& hotels$new_country != "FRA" & hotels$new_country != "ESP"
& hotels$new_country != "DEU" & hotels$new_country != "ITA"] <- "Others"
hotels$new_country[hotels$new_country == "PRT"] <- "Portugal"
hotels$new_country[hotels$new_country == "GBR"] <- "Great Britain"
hotels$new_country[hotels$new_country == "FRA"] <- "France"
hotels$new_country[hotels$new_country == "ESP"] <- "Spain"
hotels$new_country[hotels$new_country == "DEU"] <- "Germany"
hotels$new_country[hotels$new_country == "ITA"] <- "Italy"
The customers who are part of a group or group with children tend to have a longer stay
The aim is to identify whether the the visitors will stay longer as the no. of visitors in a booking would increase. There are two plots, one for just group size and other for groups which have children i.e. visitors who are part of a family.
hotels %>% mutate(group_size_n = adults+children+babies,total_stay = stays_in_weekend_nights+stays_in_week_nights)%>%
group_by(group_size_n)%>%
summarise(average_booking = mean(total_stay)) %>%
ggplot(aes(group_size_n,average_booking)) +
geom_point(color = "#00BFC4",size = 3) +
labs(title="Booking Duration by group size", x="Group Size", y="Booking Duration")+
theme_minimal()
## Warning: Removed 1 rows containing missing values (geom_point).
For groups with children:
hotels %>% mutate(group_size_n = adults+children+babies,total_stay = stays_in_weekend_nights+stays_in_week_nights)%>%
filter(is_family == TRUE) %>%
group_by(group_size_n)%>%
summarise(average_booking = mean(total_stay)) %>%
ggplot(aes(group_size_n,average_booking )) +
geom_point(color = "#00BFC4",size= 3) +
labs(title="Booking Duration by room size", x="Group Size", y="Booking Duration")+
theme_minimal()
both the plot indicate that as the group size increases, the average booking duration also increase, but a certain group size it starts decreasing. But since there are very few entries when it comes to larger group size, that point does have much weight. Overall our hypothesis stands validated as per the plot.
Groups or groups with children tend to cancel the bookings more as the count increases
The following plots will try and prove that as the count for group member would increase the rate of cancellation would also increase.
pct_count <- hotels %>% mutate( group_size_n = adults + children + babies)%>%
group_by(group_size_n)%>%
summarise(total_count = n(),count_for_cancelled = sum(is_canceled == 1),cancelled_pct = (count_for_cancelled/total_count) * 100)
ggplot(pct_count,aes(group_size_n,cancelled_pct) )+
geom_point(color = "#00BFC4",size = 3)+
ggtitle("Cancellation Percentage by group size") +
xlab("Group size") +
ylab("Cancellation Percentage")+
theme_linedraw()
## Warning: Removed 1 rows containing missing values (geom_point).
For groups with children
hotels %>% filter(is_family ==TRUE)%>%
mutate( group_size_n = adults + children + babies)%>%
group_by(group_size_n)%>%
summarise(total_count = n(),count_for_cancelled = sum(is_canceled == 1),cancelled_pct = (count_for_cancelled/total_count) * 100) %>%
ggplot(aes(group_size_n,cancelled_pct))+
geom_point(color = "#00BFC4",size = 3)+
ggtitle("Cancellation Percentage by group size") +
xlab("Group size") +
ylab("Cancellation Percentage")+
theme_linedraw()
In both the cases it can be observed that as the no. of visitors increase, the cancellation rate also increases. In the plot for groups with children we see for group size equal to 10, we don’t any calculation, it is because there is only one entry for group size equal to 10. Other than that the both plot validate our hypothesis.
Booking would peak around July-August(mid - year) for visitors from different countries
The hypothesis is that most of the people from different countries would visit around July and august i.e. is around the peak tourism season as summers are usually holiday season for kids.
leadtime_by_month <-
hotels %>%
filter(is_canceled == 0) %>%
mutate(arrival_date_month_short = str_sub(arrival_date_month,1,3),
arrival_date_month_short = fct_relevel(arrival_date_month_short,
"Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec")) %>%
group_by(new_country,arrival_date_month_short) %>%
summarise(booking_count = n())
## `summarise()` has grouped output by 'new_country'. You can override using the `.groups` argument.
ggplot(leadtime_by_month,aes(x=arrival_date_month_short,y=booking_count,group = new_country)) +
geom_line(aes(color=as.character(new_country))) +
xlab("Arrival Month") +
ylab("Total Bookings") +
ggtitle("Booking trends by country") +
labs(color = "Country")
The peaks no. of visitors from most of the countries is around July and August. So our hypothesis stands validated as Summers and Early fall looks to be the most favorable season for visiting.
The booking profile of the customer in consonance with the other feature types can be used to derive holistic insights from the data. Following hypothesis will be tested:
We create a new variable that tracks the ability of the hotel to fulfil the booking by assigning the reserved room to its customer.
hotels$same_room <- (hotels$reserved_room_type == hotels$assigned_room_type)
We create new variables to track groups with children and booking duration of all customers.
is_family <- ((hotels$babies> 0 | hotels$children > 0) & hotels$adults > 0)
hotels$is_family <- is_family
booking_duration <- hotels$stays_in_weekend_nights + hotels$stays_in_week_nights
hotels$booking_duration <- booking_duration
is_deposit = ((hotels$deposit_type == "Non Refund" | hotels$deposit_type == "Refundable"))
hotels$is_deposit <- is_deposit
hotels$is_deposit[hotels$is_deposit == TRUE] <- "Deposited"
hotels$is_deposit[hotels$is_deposit == FALSE] <- "No Deposit"
We can now visually inspect the data to test the validity of our hypothesis.
Lead time can be an indicator of the number of booking changes
To test and visualize the relationship between the booking changes and the lead time we use the following plot. Here we can see that there are a greater number of changes in bookings that were booked closer to the arrival date.
total_changes<-sum(hotels$booking_changes)
hotels %>%
filter(is_canceled == 0) %>%
mutate(
lead_time_bucket = case_when(
lead_time >= 0 & lead_time <= 50 ~ "0-50",
lead_time > 50 & lead_time <= 100 ~ "50-100",
lead_time > 100 & lead_time <= 150 ~ "100-150",
lead_time > 150 & lead_time <= 200 ~ "150-200",
lead_time > 200 & lead_time <= 250 ~ "200-250",
lead_time > 250 & lead_time <= 300 ~ "250-300",
lead_time > 200 & lead_time <= 400 ~ "300-400",
lead_time > 400 & lead_time <= 600 ~ "400-600",
lead_time > 600 ~ ">600"
),
lead_time_bucket = fct_relevel(lead_time_bucket,
"0-50","50-100","100-150","150-200","200-250","250-300","300-400", "400-600", ">600")
) %>% group_by(lead_time_bucket) %>%
summarise(changes_percent = sum(booking_changes) / total_changes) %>%
ggplot(aes(x = lead_time_bucket, y = changes_percent * 100)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
xlab("Lead Time") +
ylab("% Booking Changes") +
ggtitle("Lead Time vs % Changes")
There is a significant relation between the room types and the adr
To test the validity of our hypothesis we will use the following plot:
library(hrbrthemes)
ggplot(data=hotels, aes(x=adr, group=reserved_room_type, fill=reserved_room_type)) +
geom_density() +
coord_cartesian(xlim = c(0,450),ylim = c(0,0.02)) +
scale_y_continuous(breaks=seq(0,0.020,0.01)) +
facet_wrap(~reserved_room_type,scale="free_y") +
ggtitle("ADR for all reserved room types") +
theme(legend.position="none",
panel.spacing = unit(0.25, "lines"),
strip.text.x = element_text(size = 8)) +
xlab("ADR") +
ylab("")
The plot above helps understand the distribution of ADR for each room. A peak to the right would indicate a higher ADR on average and a flatter spread would indicate that the ADR is spread out. from this particular plot we can infer that rooms types - A, B, E and, P have on average lower ADR than the other rooms that seem to have a much flatter and wider distribution. Interestingly, room type P has almost a bar which would indicate 0 to low ADR.
There is a relation between the distribution channel and ability of the hotel to provide the requested room
The distribution channel can affect the ability of a hotel to provide the same room to a customer. To test the validity of our hypothesis we will use the following plot:
hotels %>%
mutate(distribution_channel = fct_relevel(distribution_channel,
"Corporate","Direct","TA/TO","GDS","Undefined")) %>%
ggplot(aes(distribution_channel, fill = factor(same_room))) +
geom_bar(position = "fill") +
labs(fill = "Same room") +
xlab("Distribution channel") +
ylab("Proportion") +
ggtitle("Same room per Distribution channel")
The plot above does show a small trend i.e. GDS > TA/TO > Direct > Corporate in the ability to provide the same room. To further investigate this relationship we will see if any one hotel type is causing this trend.
hotels %>%
mutate(distribution_channel = fct_relevel(distribution_channel,
"Corporate","Direct","TA/TO","GDS","Undefined")) %>%
ggplot(aes(distribution_channel, fill = factor(same_room))) +
geom_bar(position = "fill") +
labs(fill = "Same room") +
xlab("Distribution channel") +
ylab("Proportion") +
ggtitle("Distribution channel for Both hotel types") +
facet_wrap(~hotel, nrow = 2)
The plot above shows that trend is independent of the hotel type as the same can be observed here.
Hotels in the city have greater traffic and are more susceptible to overbooking
The ability of a hotel to provide the reserved room to the customer is essential in customer satisfaction. However, there are times when a hotel is unable to provide the same. Here we want to visualize the relation between overbooking and the location or type of the hotel and the number of bookings. To test the validity of our hypothesis we will use the following plot:
ggplot(data = hotels, aes(hotel, fill = factor(same_room))) +
geom_bar(position = "dodge") +
labs(fill = "Same room") +
coord_flip() +
xlab("Hotel type") +
ylab("Count")+
ggtitle("Hotel types and overbooking")
Here we can see that City Hotels do in fact have greater foot traffic but contrary to our initial hypothesis we see that Resort Hotels, despite having less than half the number of bookings have the same if not more changes to the room type which can be associated to overbooking. Thus, the location and the type of the hotel do play a role in the ability of a hotel in providing the same room to a customer.
There is a significant relation between the market segment and the assigned room types
There may be a relation between market segment and the reserved room types, the reason for this relation may be intrinsic or extrinsic to the room type. To test the validity of our hypothesis we will use the following plot:
ggplot(data = hotels, aes(assigned_room_type, fill = factor(market_segment))) +
geom_bar(position = "fill") +
coord_polar() +
xlab("Assigned Room Type") +
ylab("Proportion")+
labs(fill = "Market Segment")
The plot above shows that only room types L and P are dominated by a single market segment and there is no relation for all other rooms to any single market segment.
There is a significant relation between the distribution channel and the reserved room types
There may be a relation between distribution channel and the reserved room types, the reason for this relation may be intrinsic or extrinsic to the room type. To test the validity of our hypothesis we will use the following plot:
ggplot(data = hotels, aes(reserved_room_type, fill = factor(distribution_channel))) +
geom_bar(position = "fill") +
coord_polar() +
xlab("Reserved Room Type") +
ylab("Proportion")+
labs(fill = "Distribution Channel")+
ggtitle("Room types and Distribution channel")
Rooms A, D, L and P are dominated by one distribution channel. Room L and P are for Direct and Corporate customers predominantly. This is in line with the plot above as room P was mostly complementary. In line with the above hypothesis we can conclude that there is no significant relation for all other rooms to any single market segment.
Certain room types can have a larger lead time
To test the validity of our hypothesis we will use the following plot:
ggplot(hotels, aes(lead_time,fill=reserved_room_type)) +
geom_histogram() +
labs(title="Reserved room by Lead time",
x="Lead time") +
facet_wrap(~reserved_room_type, scale="free_y") +
theme(legend.position="none")
The plot above, in consonance from the previous plots we can see room types L and P have the lowest lead time as they are either for only direct customer or complementary. We also see Rooms A and D are the most popular with lower lead times on average. Room B and C have a flatter spread here and when seen in consonance with their ADR distribution we can see that these rooms while relatively consistent on the ADR are on the lower side with higher lead times. Room L which caters to direct customers has the Lead time and ADR distribution.
People who do not pay deposit while booking tend to cancel the booking
To test the above hypothesis we look at the trend of cancellations across deposit type and check the trend.
hotels$is_canceled <- as.character(hotels$is_canceled)
deposit_canc <- ggplot(data=hotels, aes(x=factor(deposit_type), fill=is_canceled)) +
geom_bar(col = "black",stat="count", position=position_dodge()) +
labs(title = "Bookings by deposit type", x = "Deposit Type", y = "Bookings")
deposit_canc
This shows us the most number of booking which get cancelled are where people have made no deposit for the booking. But this also gives us another interesting finding that the cancellation rate across booking with non refund is higher as compared to not cancelled bookings.
This helps us deduce that, the rate of overall cancelled bookings (bookings cancelled/booking cancelled+not cancelled) is much higher in case of Non-refund deposit type as compared to the bookings where deposit is not paid.
Groups with children tend to pay deposits while booking
We plan to filter the number of bookings where members of Groups with children > 2 and see if these booking were made with deposit being paid while booking or not. This plot will help us accept or reject the hypothesis.
temp_t <- hotels %>%
filter(is_family == TRUE) %>%
select(is_family,is_deposit)
ggplot(temp_t,aes(x=is_deposit))+
geom_bar(col="black", fill="#00BFC4")+
ggtitle("Desposit status for booking done by family") +
xlab("Deposit Type") +
ylab("Deposit Booking count")
As per the plot, bookings made by the group members largely did not go for deposit.
Majority of the people tend to not pay deposits while booking
This is a simple but significant insight, usually in hotel industry it is observed that the count for deposits while bookings are way less compared to no deposits.Let’s see the following plot to confirm this.
ggplot(hotels,aes(is_deposit))+
geom_bar(col="black", fill="#00BFC4")+
ggtitle("Did the customers pay deposits?") +
xlab("Deposit Type") +
ylab("Deposit Booking count")
So,the plot confirms that usually customers do not pay deposits while booking.
The market channel metrics conveys the booking channel that was used by the customers for making the reservation. Our market channel metrics contain - Agent, Company, Distribution Channel and Market Segment. Based on our initial exploration we put forward the following hypothesis to gauge the impact of the booking channel on customer bookings:
On a overall level we see that customers prefer to make reservations through online travel agencies as we can see in the below plots for both city Hotel and Resort Hotel.
hotel_mkseg <- hotels %>% select(hotel, arrival_date_year, market_segment, is_canceled) %>% filter(is_canceled == 0, ) %>%
group_by(hotel, arrival_date_year, market_segment) %>% summarise(bookings = n())
ggplot(data = hotel_mkseg[hotel_mkseg$hotel == "City Hotel",], aes(x = reorder(market_segment, bookings), y = bookings)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
coord_flip() +
scale_y_continuous(name="Total Bookings") +
scale_x_discrete(name="Booking Channel") +
ggtitle("City Hotel Bookings through year trend") +
facet_wrap(~arrival_date_year)
ggplot(data = hotel_mkseg[hotel_mkseg$hotel == "Resort Hotel",], aes(x = reorder(market_segment, bookings), y = bookings)) +
geom_bar(stat = "identity", fill = "#00BFC4") +
coord_flip() +
scale_y_continuous(name="Total Bookings") +
scale_x_discrete(name="Booking Channel") +
ggtitle("Resort Hotel Bookings through year trend") +
facet_wrap(~arrival_date_year)
Over the years the online travel agencies have gained more popularity which is reflected in our data as the number of bookings increase on this channel.
Based on the above data exploration we now check our data for trends to validate our proposed hypothesis to devise strategies to hotel bookings.
People with more family members tend to make bookings with travel agencies
For the first hypothesis we will look at the number of bookings with family members > 2 and see if these bookings are majorly made through an agency or no using the agency_type variable. This plot will help us accept or reject the hypothesis.
hotels$members = hotels$adults + hotels$children + hotels$babies
hotels_not_cancelled <- hotels %>% filter(is_canceled==0)
hotels_not_cancelled <- hotels_not_cancelled %>%
mutate(flag = if_else(members > 2, "More people(>2)","Less people(<=2)"))
hotels_not_cancelled$flag <- as.character(hotels_not_cancelled$flag)
hotel_sub_mktg <- hotels_not_cancelled %>%
mutate(market_segment = fct_relevel(market_segment,"Direct","Online TA","Complementary","Offline TA/TO","Groups","Corporate","Aviation")) %>%
group_by(market_segment, flag) %>%
summarise(count = n()) %>%
mutate(perc = count/sum(count))
ggplot(hotel_sub_mktg, aes(x = factor(market_segment), y = perc*100, fill = factor(flag))) +
geom_bar(stat="identity", width = 0.5) +
labs(x = "Marketing segment", y = "Percentage booking", fill = "Group Type") +
ggtitle("% Bookings across marketing channels by family/group size") +
theme_minimal() +
coord_flip()
Here first we filtered for the bookings which were not cancelled. Among those, we created a flag where, if the members > 2 for a booking, flag is 1, 2 or less, flag is 0.
We see that booking with people more than 2 are done mostly direct bookings or through online travel agent. Which proves our hypothesis to be true.
The bookings made through travel agencies tend to not get cancelled
In order to test the validity of the second hypothesis proposed, we will filter the bookings made through travel agencies from the market_segment and plot these booking against the cancellation flag is_canceled. This plot will help us accept or reject the hypothesis.
hotels$is_canceled <- as.character(hotels$is_canceled)
hotels_sub <- hotels %>%
mutate(market_segment = fct_relevel(market_segment,"Undefined","Groups","Online TA","Offline TA/TO","Aviation","Corporate",
"Direct","Complementary"))%>%
group_by(market_segment, is_canceled) %>%
summarise(count = n()) %>%
mutate(perc = count/sum(count))
ggplot(hotels_sub, aes(x = factor(market_segment), y = perc*100, fill = factor(is_canceled))) +
geom_bar(stat="identity", width = 0.5) +
labs(x = "Marketing segment", y = "Percentage booking", fill = "Cancellation") +
ggtitle("% Cancellations across marketing channels ") +
theme_minimal() +
coord_flip()
Here we see that comparing the number of cancellations across different channels, maximum cancellations occur for bookings made by Groups(~61%) followed by Online Travel agents(~34%)
Below are a few recommendations based on our analysis:
The objective of the exploration was to understand the key drivers behind Resort and City hotel bookings in Portugal. From the detailed analysis we were able to identify few key drivers like - lead time, group size, service types, etc. Based on these influential factors, strategic and marketing recommendations were devised. These factor identification can be further leveraged to understand sales fluctuations, booking prediction and campaign design for hotels from this regions. Few of the broader insights can be leveraged across the hotel industry.