In 2019 the travel and tourism sector accounted for over 10% of the world’s GDP and 1 in 10 jobs across the globe making it one of the largest industries in the world. The industry grew 3.5% in 2019 and is expected to continue growing over the next 10 years [(World Travel and Tourism Council, 2019)] (https://wttc.org/Research/Economic-Impact). As the industry continues to expand, so does competition. Hotel firms are engaged in an arduous battle to cut costs and create more value for guests than competing firms. One of the most imminent barriers to boasting profits and outperforming rivals are cancellations. Booking cancellations are one of the biggest threats to the hospitality industry. To protect against the threat of cancellations, hotels utilize cancellation policies and overbooking tactics. However, if used incorrectly overbooking strategies can produce detrimental effects and undermine a firm’s market position. Hotels often overbook based on the assumption that a certain percentage of guests will cancel their reservations. Overbooking protects hotels from losses due to operating below capacity, however it also creates additional risk. If there are fewer cancellations than forecasted, hotels will be unable to provide rooms for guests which can result in the loss of loyal customers and brand tarnishment [(Antonio et al., 2017)] (https://ieeexplore.ieee.org/document/8260781). Determining the optimal rate of overbooking serves the best interests of both hotel firms and consumers by enabling firms to maximize profits while adequately satisfying consumer needs. Effective overbooking strategies are also critical to staying competitive in the industry. To navigate the growing competition in the travel and tourism industry, a major hotel corporation has requested our services to determine the optimal rate of overbooking in order to minimize risk and maximize profits.
To determine the optimal overbooking rate for our client we will analyze comprehensive hotel booking data collected from 2015-2017. We will evaluate numerous variables to determine their relationship with the cancellation rate variable. A variety of approaches and analytic techniques will be utilized to analyze and make sense of the data. Techniques include exploratory data analysis, analysis of the variance, and linear regression. Gaining insight on the relationship of different variables with the cancellation rate variable will enable us to advise our client on the factors that must be considered when determining the ideal overbooking rate. Overall this will enable our client to utilize the most efficient cancellation rate to maximize profits, minimize risk, and remain competitive in the industry.
Our data was obtained from Antonio, Almeida, and Nunes (2019)
setwd("C:/Users/messe/Documents/Masters/Fall 2020/Data Wrangling/Project");
data.df <- read.csv("C:/Users/messe/Documents/Masters/Fall 2020/Data Wrangling/Project/hotels.csv", stringsAsFactors = FALSE)
options(warn = -1)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(RColorBrewer)
#install.packages("threadr")
library(threadr)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
##
## Attaching package: 'threadr'
## The following objects are masked from 'package:readr':
##
## parse_time, read_lines, type_convert
The data was originally collected for the role of research in various fields such as education in revenue management, machine learning, and data mining. The data is a culmination of two data sets - one on a resort hotel and the other on a city hotel. In total the data has 31 variables, with 119,390 observations. In this context, observations represent hotel bookings. The data was collected from July of 2015 to August 2017. The data appears to be relatively clean. However, different variables use inconsistent naming conventions to classify missing values. Missing values are classified as NA, undefined, and Null throughout the data. We will utilize data cleaning techniques to uniformly classify missing values and make a strategic decision on whether to omit, change, or keep missing values.
We will clean the data to create consistency and ensure it can be thoroughly and efficiently analyzed.
To clean the data we will: * Ensure variables names follow a proper, uniform naming convention + Snake case was selected as the designated naming convention + Source data was received in selected naming convention
data.df$meal <- replace(data.df$meal, data.df$meal=="Undefined", "none");
data.df$meal <- replace(data.df$meal, data.df$meal=="SC", "none")
data.df %>%
head(10) #Dataset with missing or null values 119,390 values
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## 7 27 1 0
## 8 27 1 0
## 9 27 1 0
## 10 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## 7 2 2 0 0 BB PRT Direct
## 8 2 2 0 0 FB PRT Direct
## 9 3 2 0 0 BB PRT Online TA
## 10 3 2 0 0 HB PRT Offline TA/TO
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## 7 Direct 0 0
## 8 Direct 0 0
## 9 TA/TO 0 0
## 10 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## 7 0 C C
## 8 0 C C
## 9 0 A A
## 10 0 D D
## booking_changes deposit_type agent company days_in_waiting_list
## 1 3 No Deposit NULL NULL 0
## 2 4 No Deposit NULL NULL 0
## 3 0 No Deposit NULL NULL 0
## 4 0 No Deposit 304 NULL 0
## 5 0 No Deposit 240 NULL 0
## 6 0 No Deposit 240 NULL 0
## 7 0 No Deposit NULL NULL 0
## 8 0 No Deposit 303 NULL 0
## 9 0 No Deposit 240 NULL 0
## 10 0 No Deposit 15 NULL 0
## customer_type adr required_car_parking_spaces total_of_special_requests
## 1 Transient 0.0 0 0
## 2 Transient 0.0 0 0
## 3 Transient 75.0 0 0
## 4 Transient 75.0 0 0
## 5 Transient 98.0 0 1
## 6 Transient 98.0 0 1
## 7 Transient 107.0 0 0
## 8 Transient 103.0 0 1
## 9 Transient 82.0 0 1
## 10 Transient 105.5 0 0
## reservation_status reservation_status_date
## 1 Check-Out 7/1/2015
## 2 Check-Out 7/1/2015
## 3 Check-Out 7/2/2015
## 4 Check-Out 7/2/2015
## 5 Check-Out 7/3/2015
## 6 Check-Out 7/3/2015
## 7 Check-Out 7/3/2015
## 8 Check-Out 7/3/2015
## 9 Canceled 5/6/2015
## 10 Canceled 4/22/2015
data.df <- na.omit(data.df) #New dataset with 119,386 values
data.df <- data.df[-c(31,32)]
The category “Undefined” in market_segment and distribution_channel were changed to “none”
distribution_channel was removed to avoid correlation between market_segment and distribution_channel in our analysis.
data.df <- data.df[-c(16)]
data.df <- data.df[-c(6)]
data.df <- data.df[-c(27)]
data.df$agent <- replace(data.df$agent, data.df$agent=="NULL","none");
data.df$company <- replace(data.df$company, data.df$company=="NULL","none")
data.df.one <- data.df
data.df.one$hotel <- as.factor(data.df$hotel);
data.df.one$arrival_date_year <- as.factor(data.df$arrival_date_year);
data.df.one$arrival_date_month <- as.factor(data.df$arrival_date_month);
data.df.one$meal <- as.factor(data.df$meal);
data.df.one$country <- as.factor(data.df$country);
data.df.one$market_segment <- as.factor(data.df$market_segment);
data.df.one$reserved_room_type <- as.factor(data.df$reserved_room_type);
data.df.one$assigned_room_type <- as.factor(data.df$assigned_room_type);
data.df.one$booking_changes <- as.factor(data.df$booking_changes);
data.df.one$deposit_type <- as.factor(data.df$deposit_type);
data.df.one$agent <- as.factor(data.df$agent);
data.df.one$company <- as.factor(data.df$company);
data.df.one$customer_type <- as.factor(data.df$customer_type)
# Add unique row ID
data.df.one$ID <- uuid_rowwise(data.df.one)
# Translate numeric variables into categorical data based on quantiles
data.df.one %>%
summary()
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79326 Min. :0.0000 Min. : 0 2015:21992
## Resort Hotel:40060 1st Qu.:0.0000 1st Qu.: 18 2016:56707
## Median :0.0000 Median : 69 2017:40687
## Mean :0.3704 Mean :104
## 3rd Qu.:1.0000 3rd Qu.:160
## Max. :1.0000 Max. :737
##
## arrival_date_month arrival_date_day_of_month stays_in_weekend_nights
## August :13873 Min. : 1.0 Min. : 0.0000
## July :12661 1st Qu.: 8.0 1st Qu.: 0.0000
## May :11791 Median :16.0 Median : 1.0000
## October:11160 Mean :15.8 Mean : 0.9276
## April :11089 3rd Qu.:23.0 3rd Qu.: 2.0000
## June :10939 Max. :31.0 Max. :19.0000
## (Other):47873
## stays_in_week_nights adults children babies
## Min. : 0.0 Min. : 0.000 Min. : 0.0000 Min. : 0.000000
## 1st Qu.: 1.0 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.000000
## Median : 2.0 Median : 2.000 Median : 0.0000 Median : 0.000000
## Mean : 2.5 Mean : 1.856 Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 3.0 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :50.0 Max. :55.000 Max. :10.0000 Max. :10.000000
##
## meal country market_segment is_repeated_guest
## BB :92306 PRT :48586 Aviation : 237 Min. :0.00000
## FB : 798 GBR :12129 Complementary: 743 1st Qu.:0.00000
## HB :14463 FRA :10415 Corporate : 5295 Median :0.00000
## none:11819 ESP : 8568 Direct :12605 Mean :0.03191
## DEU : 7287 Groups :19811 3rd Qu.:0.00000
## ITA : 3766 Offline TA/TO:24219 Max. :1.00000
## (Other):28635 Online TA :56476
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1114
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 0 :101310 No Deposit:104637 9 :31960
## D :25322 1 : 12701 Non Refund: 14587 none :16338
## E : 7806 2 : 3805 Refundable: 162 240 :13922
## F : 3751 3 : 927 1 : 7191
## G : 2553 4 : 376 14 : 3639
## C : 2375 5 : 118 7 : 3539
## (Other): 3526 (Other): 149 (Other):42797
## company days_in_waiting_list customer_type
## none :112589 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25120
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr total_of_special_requests ID
## Min. : -6.38 Min. :0.0000 Length:119386
## 1st Qu.: 69.29 1st Qu.:0.0000 Class :character
## Median : 94.59 Median :0.0000 Mode :character
## Mean : 101.83 Mean :0.5713
## 3rd Qu.: 126.00 3rd Qu.:1.0000
## Max. :5400.00 Max. :5.0000
##
# Identify numeric data
data.df.one %>%
select(where(is.numeric)) %>%
head(10)
## is_canceled lead_time arrival_date_day_of_month stays_in_weekend_nights
## 1 0 342 1 0
## 2 0 737 1 0
## 3 0 7 1 0
## 4 0 13 1 0
## 5 0 14 1 0
## 6 0 14 1 0
## 7 0 0 1 0
## 8 0 9 1 0
## 9 1 85 1 0
## 10 1 75 1 0
## stays_in_week_nights adults children babies is_repeated_guest
## 1 0 2 0 0 0
## 2 0 2 0 0 0
## 3 1 1 0 0 0
## 4 1 1 0 0 0
## 5 2 2 0 0 0
## 6 2 2 0 0 0
## 7 2 2 0 0 0
## 8 2 2 0 0 0
## 9 3 2 0 0 0
## 10 3 2 0 0 0
## previous_cancellations previous_bookings_not_canceled days_in_waiting_list
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 0
## adr total_of_special_requests
## 1 0.0 0
## 2 0.0 0
## 3 75.0 0
## 4 75.0 0
## 5 98.0 1
## 6 98.0 1
## 7 107.0 0
## 8 103.0 1
## 9 82.0 1
## 10 105.5 0
# Identify Categorical data
data.df.one %>%
select(where(is.factor)) %>%
head(10)
## hotel arrival_date_year arrival_date_month meal country
## 1 Resort Hotel 2015 July BB PRT
## 2 Resort Hotel 2015 July BB PRT
## 3 Resort Hotel 2015 July BB GBR
## 4 Resort Hotel 2015 July BB GBR
## 5 Resort Hotel 2015 July BB GBR
## 6 Resort Hotel 2015 July BB GBR
## 7 Resort Hotel 2015 July BB PRT
## 8 Resort Hotel 2015 July FB PRT
## 9 Resort Hotel 2015 July BB PRT
## 10 Resort Hotel 2015 July HB PRT
## market_segment reserved_room_type assigned_room_type booking_changes
## 1 Direct C C 3
## 2 Direct C C 4
## 3 Direct A C 0
## 4 Corporate A A 0
## 5 Online TA A A 0
## 6 Online TA A A 0
## 7 Direct C C 0
## 8 Direct C C 0
## 9 Online TA A A 0
## 10 Offline TA/TO D D 0
## deposit_type agent company customer_type
## 1 No Deposit none none Transient
## 2 No Deposit none none Transient
## 3 No Deposit none none Transient
## 4 No Deposit 304 none Transient
## 5 No Deposit 240 none Transient
## 6 No Deposit 240 none Transient
## 7 No Deposit none none Transient
## 8 No Deposit 303 none Transient
## 9 No Deposit 240 none Transient
## 10 No Deposit 15 none Transient
# Number of Datapoints and number of variables
dim_type <- c("datapoints","variables")
tibble(dim_type, dim(data.df.one))
## # A tibble: 2 x 2
## dim_type `dim(data.df.one)`
## <chr> <int>
## 1 datapoints 119386
## 2 variables 28
# Get quantiles for each numeric variable
#arrival_date_day_of_month
quantile(data.df.one$arrival_date_day_of_month)
## 0% 25% 50% 75% 100%
## 1 8 16 23 31
#stays_in_weekend_nights
quantile(data.df.one$stays_in_weekend_nights)
## 0% 25% 50% 75% 100%
## 0 0 1 2 19
#stays_in_week_nights
quantile(data.df.one$stays_in_week_nights)
## 0% 25% 50% 75% 100%
## 0 1 2 3 50
#adults
quantile(data.df.one$adults)
## 0% 25% 50% 75% 100%
## 0 2 2 2 55
#children
quantile(data.df.one$children, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0 0 0 0 10
#babies
quantile(data.df.one$babies)
## 0% 25% 50% 75% 100%
## 0 0 0 0 10
#is_repeated_guest
quantile(data.df.one$is_repeated_guest)
## 0% 25% 50% 75% 100%
## 0 0 0 0 1
#previous_cancellations
quantile(data.df.one$previous_cancellations)
## 0% 25% 50% 75% 100%
## 0 0 0 0 26
#previous_bookings_not_canceled
quantile(data.df.one$previous_bookings_not_canceled)
## 0% 25% 50% 75% 100%
## 0 0 0 0 72
#days_in_waiting_list
quantile(data.df.one$days_in_waiting_list)
## 0% 25% 50% 75% 100%
## 0 0 0 0 391
#total_of_special_requests
quantile(data.df.one$total_of_special_requests)
## 0% 25% 50% 75% 100%
## 0 0 0 1 5
We will discover information in the data that is not self-evident by scrutinizing the impact of various individual variables on the cancellation rate. We separate the data in to quantiles to account for outliers and enable a closer examination of each variable. Our visualizations are especially helpful in depicting these relationships.
# Number of cancellations per hotel type
x1 <- data.df.one %>%
select(is_canceled, hotel) %>%
group_by(hotel) %>%
filter(is_canceled == 1) %>%
summarise(no_rows = length(hotel), .groups = 'drop') %>%
spread(hotel, no_rows)
colnames(x1) <- paste(colnames(x1), "canceled", sep = "_");
# Number of non-cancellations per hotel type
x2 <- data.df.one %>%
select(is_canceled, hotel) %>%
group_by(hotel) %>%
filter(is_canceled == 0) %>%
summarise(no_rows = length(hotel), .groups = 'drop') %>%
spread(hotel, no_rows)
colnames(x2) <- paste(colnames(x2), "not_canceled", sep = "_");
# Combine data sets
hotel_type_cancellations <- tibble(x1,x2);
# Function for calculating percent cancellation per type
percent_cancellation <- function(x,y) {
n <- (sum(x)+sum(y))
(sum(x)/n)
}
# Function for calculating percent cancellation per type
population_size_n <- function(x,y) {
(sum(x)+sum(y))
}
# Run percent_cancellation to determine percent cancellation of City Hotels
percent_cancellation(hotel_type_cancellations$`City Hotel_canceled`,hotel_type_cancellations$`City Hotel_not_canceled`)
## [1] 0.4172402
# Run population_size_n to determine the total population of each category
population_size_n(hotel_type_cancellations$`City Hotel_canceled`,hotel_type_cancellations$`City Hotel_not_canceled`)
## [1] 79326
# Run percent_cancellation to determine percent cancellation of Resort Hotels
percent_cancellation(hotel_type_cancellations$`Resort Hotel_canceled`,hotel_type_cancellations$`Resort Hotel_not_canceled`)
## [1] 0.2776335
# Run population_size_n to determine the total population of each category
population_size_n(hotel_type_cancellations$`Resort Hotel_canceled`,hotel_type_cancellations$`Resort Hotel_not_canceled`)
## [1] 40060
Approximately, 28% of bookings at resort hotels were cancelled, whereas 42% of bookings at city hotels were cancelled. Typically people stay at resort hotels for vacations and at city hotels for quick weekend getaways or business trips. The discrepancy in cancellation rates between the two variables may be due to the reluctance of people to cancel vacations that have been planned far in advance.
data.df.one %>%
select(where(is.factor)) %>%
head(10)
## hotel arrival_date_year arrival_date_month meal country
## 1 Resort Hotel 2015 July BB PRT
## 2 Resort Hotel 2015 July BB PRT
## 3 Resort Hotel 2015 July BB GBR
## 4 Resort Hotel 2015 July BB GBR
## 5 Resort Hotel 2015 July BB GBR
## 6 Resort Hotel 2015 July BB GBR
## 7 Resort Hotel 2015 July BB PRT
## 8 Resort Hotel 2015 July FB PRT
## 9 Resort Hotel 2015 July BB PRT
## 10 Resort Hotel 2015 July HB PRT
## market_segment reserved_room_type assigned_room_type booking_changes
## 1 Direct C C 3
## 2 Direct C C 4
## 3 Direct A C 0
## 4 Corporate A A 0
## 5 Online TA A A 0
## 6 Online TA A A 0
## 7 Direct C C 0
## 8 Direct C C 0
## 9 Online TA A A 0
## 10 Offline TA/TO D D 0
## deposit_type agent company customer_type
## 1 No Deposit none none Transient
## 2 No Deposit none none Transient
## 3 No Deposit none none Transient
## 4 No Deposit 304 none Transient
## 5 No Deposit 240 none Transient
## 6 No Deposit 240 none Transient
## 7 No Deposit none none Transient
## 8 No Deposit 303 none Transient
## 9 No Deposit 240 none Transient
## 10 No Deposit 15 none Transient
monthly_cancellations <- data.df.one %>%
select(is_canceled, arrival_date_month) %>%
filter(is_canceled == 1) %>%
group_by(arrival_date_month) %>%
summarize(cancellations = n(), .groups = 'drop')
monthly_bookings <- data.df.one %>%
select(is_canceled, arrival_date_month) %>%
group_by(arrival_date_month) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
monthly_cancellations.df <- full_join(monthly_bookings, monthly_cancellations, by = "arrival_date_month") %>%
mutate(cancel_rate = cancellations/bookings) %>%
mutate(month_number = ifelse(arrival_date_month == "January",1,
ifelse(arrival_date_month == "February",2,
ifelse(arrival_date_month == "March",3,
ifelse(arrival_date_month == "April",4,
ifelse(arrival_date_month == "May",5,
ifelse(arrival_date_month == "June",6,
ifelse(arrival_date_month == "July",7,
ifelse(arrival_date_month == "August",8,
ifelse(arrival_date_month == "September",9,
ifelse(arrival_date_month == "October", 10,
ifelse(arrival_date_month == "November",11,
12)))))))))))) %>%
arrange((month_number))
monthly_cancellations.chart <- ggplot(monthly_cancellations.df, aes(y=cancel_rate, reorder(x=arrival_date_month, month_number), fill=arrival_date_month)) +
geom_bar( stat="identity") +
scale_x_discrete(labels=c("Jan", "Feb", "Mar", "Apr", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
scale_fill_manual(values = brewer.pal(n=12,name="Set3")) +
labs(y = "Cancel Rate", x = "Arrival Month", title = "Cancel Rate by Arrival Month")
monthly_cancellations.df;
## # A tibble: 12 x 5
## arrival_date_month bookings cancellations cancel_rate month_number
## <fct> <int> <int> <dbl> <dbl>
## 1 January 5929 1807 0.305 1
## 2 February 8068 2696 0.334 2
## 3 March 9794 3149 0.322 3
## 4 April 11089 4524 0.408 4
## 5 May 11791 4677 0.397 5
## 6 June 10939 4535 0.415 6
## 7 July 12661 4742 0.375 7
## 8 August 13873 5235 0.377 8
## 9 September 10508 4116 0.392 9
## 10 October 11160 4246 0.380 10
## 11 November 6794 2122 0.312 11
## 12 December 6780 2371 0.350 12
monthly_cancellations.chart;
Cancellations fluctuate based on the season. Cancellations are highest in the spring and early summer and increase again in late summer.
meal_cancellations <- data.df.one %>%
select(is_canceled, meal) %>%
filter(is_canceled == 1) %>%
group_by(meal) %>%
summarize(cancellations = n(), .groups = 'drop')
meal_bookings <- data.df.one %>%
select(is_canceled, meal) %>%
group_by(meal) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
meal_cancellations.df <- full_join(meal_bookings, meal_cancellations, by = "meal") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
meal_cancellations.chart <- ggplot(meal_cancellations.df, aes( y=cancel_rate, reorder(x=meal, -cancel_rate), fill =meal)) +
geom_bar( stat="identity") +
scale_x_discrete(labels=c("Full Board", "Bed & Breakfast", "None", "Half Board")) +
scale_fill_manual(values = brewer.pal(n=4,name="Set3")) +
labs(y = "Cancel Rate", x = "Meal Type ", title = "Cancel Rate by Meal Type")
meal_cancellations.chart;
meal_cancellations.df;
## # A tibble: 4 x 4
## meal bookings cancellations cancel_rate
## <fct> <int> <int> <dbl>
## 1 FB 798 478 0.599
## 2 BB 92306 34506 0.374
## 3 none 11819 4252 0.360
## 4 HB 14463 4984 0.345
Cancellation rates are significantly higher (59.9%) for hotels serving full board breakfasts. Bookings are significantly smaller for hotels with full board breakfasts so the significance of this finding is questionable.
arrival_date_year_cancellations <- data.df.one %>%
select(is_canceled, arrival_date_year) %>%
filter(is_canceled == 1) %>%
group_by(arrival_date_year) %>%
summarize(cancellations = n(), .groups = 'drop')
arrival_date_year_bookings <- data.df.one %>%
select(is_canceled, arrival_date_year) %>%
group_by(arrival_date_year) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
arrival_date_year_cancellations.df <- full_join(arrival_date_year_bookings, arrival_date_year_cancellations, by = "arrival_date_year") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(arrival_date_year)) %>%
mutate(arrival_date_year_int = as.integer(arrival_date_year))
arrival_date_year_cancellations.chart <- ggplot(arrival_date_year_cancellations.df, aes( y=cancel_rate, reorder(x=arrival_date_year, arrival_date_year_int), fill=arrival_date_year)) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=3,name="Set3")) +
labs(y = "Cancel Rate", x = "Arrival Year", title = "Cancel Rate by Arrival Year")
arrival_date_year_cancellations.chart;
arrival_date_year_cancellations.df;
## # A tibble: 3 x 5
## arrival_date_year bookings cancellations cancel_rate arrival_date_year_int
## <fct> <int> <int> <dbl> <int>
## 1 2017 40687 15745 0.387 3
## 2 2016 56707 20337 0.359 2
## 3 2015 21992 8138 0.370 1
Bookings trended upwards in 2017 indicating people may be booking more due to the availability of other options such as AirBnB’s. Further research should examine trends in cancellation rates in the last 5 years.
country_cancellations <- data.df.one %>%
select(is_canceled, country) %>%
filter(is_canceled == 1) %>%
group_by(country) %>%
summarize(cancellations = n(), .groups = 'drop')
country_bookings <- data.df.one %>%
select(is_canceled, country) %>%
group_by(country) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
country_cancellations.df <- full_join(country_bookings, country_cancellations, by = "country") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(bookings))
country_cancellations.df
## # A tibble: 178 x 4
## country bookings cancellations cancel_rate
## <fct> <int> <int> <dbl>
## 1 PRT 48586 27515 0.566
## 2 GBR 12129 2453 0.202
## 3 FRA 10415 1934 0.186
## 4 ESP 8568 2177 0.254
## 5 DEU 7287 1218 0.167
## 6 ITA 3766 1333 0.354
## 7 IRL 3375 832 0.247
## 8 BEL 2342 474 0.202
## 9 BRA 2224 830 0.373
## 10 NLD 2104 387 0.184
## # ... with 168 more rows
# Include only countries in 90th percentile of total bookings (>644)
quantile(country_cancellations.df$bookings,.90)
## 90%
## 643.1
country_cancellations_90.df <- country_cancellations.df %>%
filter(bookings >= 644)
country_cancellations_90.df;
## # A tibble: 18 x 4
## country bookings cancellations cancel_rate
## <fct> <int> <int> <dbl>
## 1 PRT 48586 27515 0.566
## 2 GBR 12129 2453 0.202
## 3 FRA 10415 1934 0.186
## 4 ESP 8568 2177 0.254
## 5 DEU 7287 1218 0.167
## 6 ITA 3766 1333 0.354
## 7 IRL 3375 832 0.247
## 8 BEL 2342 474 0.202
## 9 BRA 2224 830 0.373
## 10 NLD 2104 387 0.184
## 11 USA 2097 501 0.239
## 12 CHE 1730 428 0.247
## 13 CN 1279 254 0.199
## 14 AUT 1263 230 0.182
## 15 SWE 1024 227 0.222
## 16 CHN 999 462 0.462
## 17 POL 919 215 0.234
## 18 ISR 669 169 0.253
colourCount = length(unique(country_cancellations_90.df$country))
getPalette = colorRampPalette(brewer.pal(n=18, name="Set3"))
country_cancellations_90.chart <- ggplot(country_cancellations_90.df, aes( y=cancel_rate, reorder(x=country, -cancel_rate), fill=country)) +
geom_bar(stat="identity") +
labs(y = "Cancel Rate", x = "Country", title = "Cancel Rate by Country")
country_cancellations_90.chart;
# Makes sense that China has a high cancellation rate becasue, in oreder to ger a visa, you have to have already decided where you are going to stay. Further, it is very common to book a hotel and cancel upon arrival.
Geographical and cultural norms may account for differences in cancellation rates across the globe. Portugal and China have cancellation rates of 57% and 46%, respectively which are the highest cancellation rates among the 18 countries observed. Both of these countries require visitors have a hotel reservations to get a visa meaning after visitors obtain a Visa they cancel their hotel bookings and make different staying arrangements. Differences in the number of bookings for each country and cultural norms and practices may also attribute for different cancellation rates throughout the world.
market_segment_cancellations <- data.df.one %>%
select(is_canceled, market_segment) %>%
filter(is_canceled == 1) %>%
group_by(market_segment) %>%
summarize(cancellations = n(), .groups = 'drop')
market_segment_bookings <- data.df.one %>%
select(is_canceled, market_segment) %>%
group_by(market_segment) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
market_segment_cancellations.df <- full_join(market_segment_bookings, market_segment_cancellations, by = "market_segment") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
market_segment_cancellations.chart <- ggplot(na.omit(market_segment_cancellations.df),aes( y=cancel_rate, reorder(x=market_segment, -cancel_rate), fill=market_segment)) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=8,name="Set3")) +
labs(y = "Cancel Rate", x = "Market Segment", title = "Cancel Rate by Market Segment")
market_segment_cancellations.chart;
market_segment_cancellations.df;
## # A tibble: 7 x 4
## market_segment bookings cancellations cancel_rate
## <fct> <int> <int> <dbl>
## 1 Groups 19811 12097 0.611
## 2 Online TA 56476 20738 0.367
## 3 Offline TA/TO 24219 8311 0.343
## 4 Aviation 237 52 0.219
## 5 Corporate 5295 992 0.187
## 6 Direct 12605 1933 0.153
## 7 Complementary 743 97 0.131
Here we see that complementary trips are very unlikely to be cancelled. Additionally, we see that groups have a high cancellations rate. These findings are backed up by later findings that show, the more people in a reservation, the more likely to be cancelled. Bookings made online also have a relatively high cancel rate. This is likely due to the ease of making these plans and ease of canelling them.
reserved_room_type_cancellations <- data.df.one %>%
select(is_canceled, reserved_room_type) %>%
filter(is_canceled == 1) %>%
group_by(reserved_room_type) %>%
summarize(cancellations = n(), .groups = 'drop')
reserved_room_type_bookings <- data.df.one %>%
select(is_canceled, reserved_room_type) %>%
group_by(reserved_room_type) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
reserved_room_type_cancellations.df <- full_join(reserved_room_type_bookings, reserved_room_type_cancellations, by = "reserved_room_type") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
reserved_room_type_cancellations.chart <- ggplot(reserved_room_type_cancellations.df, aes( y=cancel_rate, reorder(x=reserved_room_type, -cancel_rate), fill=reserved_room_type)) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=10,name="Set3")) +
labs(y = "Cancel Rate", x = "Reserved Room Type", title = "Cancel Rate by Reserved Room Type")
reserved_room_type_cancellations.df;
## # A tibble: 10 x 4
## reserved_room_type bookings cancellations cancel_rate
## <fct> <int> <int> <dbl>
## 1 P 12 12 1
## 2 H 601 245 0.408
## 3 A 85994 33630 0.391
## 4 G 2094 763 0.364
## 5 L 6 2 0.333
## 6 C 932 308 0.330
## 7 B 1114 364 0.327
## 8 D 19201 6102 0.318
## 9 F 2897 880 0.304
## 10 E 6535 1914 0.293
reserved_room_type_cancellations.chart;
assigned_room_type_cancellations <- data.df.one %>%
select(is_canceled, assigned_room_type) %>%
filter(is_canceled == 1) %>%
group_by(assigned_room_type) %>%
summarize(cancellations = n(), .groups = 'drop')
assigned_room_type_bookings <- data.df.one %>%
select(is_canceled, assigned_room_type) %>%
group_by(assigned_room_type) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
assigned_room_type_cancellations.df <- full_join(assigned_room_type_bookings, assigned_room_type_cancellations, by = "assigned_room_type") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
assigned_room_type_cancellations.chart <- ggplot(assigned_room_type_cancellations.df, aes( y=cancel_rate, reorder(x=assigned_room_type, -cancel_rate), fill=assigned_room_type)) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=12,name="Set3")) +
labs(y = "Cancel Rate", x = "Assigned Room Type", title = "Cancel Rate by Assigned Room Type")
assigned_room_type_cancellations.df;
## # A tibble: 12 x 4
## assigned_room_type bookings cancellations cancel_rate
## <fct> <int> <int> <dbl>
## 1 L 1 1 1
## 2 P 12 12 1
## 3 A 74053 32948 0.445
## 4 H 712 251 0.353
## 5 G 2553 780 0.306
## 6 E 7806 1968 0.252
## 7 D 25322 6362 0.251
## 8 F 3751 927 0.247
## 9 B 2159 508 0.235
## 10 C 2375 446 0.188
## 11 K 279 12 0.0430
## 12 I 363 5 0.0138
assigned_room_type_cancellations.chart;
#Add comparison to see if assigned room is the same as the reserved room
#Comparison to see if assigned room is the same as the reserved room
reserved_room_change.df <- data.df.one %>%
select(is_canceled, reserved_room_type, assigned_room_type) %>%
mutate(room_changed = ifelse(str_detect(as.character(assigned_room_type), as.character(reserved_room_type)) == "TRUE",0,1)) %>%
mutate(changed_room_type = as.factor(paste(reserved_room_type,assigned_room_type,sep="_to_")))
changed_room_type_cancellations <- reserved_room_change.df %>%
select(is_canceled, room_changed, changed_room_type) %>%
filter(is_canceled == 1, room_changed == 1) %>%
group_by(changed_room_type) %>%
summarize(cancellations = n(), .groups = 'drop')
changed_room_type_bookings <- reserved_room_change.df %>%
select(is_canceled, room_changed, changed_room_type) %>%
group_by(changed_room_type) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
changed_room_type.df <- left_join(changed_room_type_cancellations, changed_room_type_bookings, by = "changed_room_type") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
# Include only countries in 75th percentile of total bookings
quantile(changed_room_type.df$bookings,.75)
## 75%
## 287.75
changed_room_type_75.df <- changed_room_type.df %>%
filter(bookings >= quantile(changed_room_type.df$bookings,.75))
changed_room_type_75.df;
## # A tibble: 8 x 4
## changed_room_type cancellations bookings cancel_rate
## <fct> <int> <int> <dbl>
## 1 A_to_B 146 1123 0.130
## 2 A_to_C 139 1447 0.0961
## 3 A_to_F 27 417 0.0647
## 4 D_to_E 35 686 0.0510
## 5 E_to_F 19 404 0.0470
## 6 A_to_D 319 7548 0.0423
## 7 A_to_E 47 1156 0.0407
## 8 D_to_A 10 312 0.0321
changed_room_type_75.chart <- ggplot(changed_room_type_75.df, aes( y=cancel_rate, reorder(x=changed_room_type, -cancel_rate), fill=changed_room_type)) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=12,name="Set3")) +
labs(y = "Cancel Rate", x = "Changed Room Type", title = "Cancel Rate by Changed Room Type - 75th Percentile")
changed_room_type_75.chart
# check cancel rate when room is not changed. Used to support assigned room cancellations to determine whether assigned room was the room that was reserved when they canceled.
unchanged_room_type_cancellations <- reserved_room_change.df %>%
select(is_canceled, room_changed, changed_room_type) %>%
filter(is_canceled == 1, room_changed == 0) %>%
group_by(changed_room_type) %>%
summarize(cancellations = n(), .groups = 'drop')
unchanged_room_type_bookings <- reserved_room_change.df %>%
select(is_canceled, room_changed, changed_room_type) %>%
group_by(changed_room_type) %>%
summarize(bookings = n(), .groups = 'drop')
unchanged_room_type.df <- left_join(unchanged_room_type_cancellations, unchanged_room_type_bookings, by = "changed_room_type") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
quantile(changed_room_type.df$bookings,.75)
## 75%
## 287.75
unchanged_room_type_75.df <- unchanged_room_type.df %>%
filter(bookings >= quantile(unchanged_room_type.df$bookings,.75))
unchanged_room_type_75.chart <- ggplot(unchanged_room_type.df, aes( y=cancel_rate, reorder(x=changed_room_type, -cancel_rate), fill=changed_room_type)) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=12,name="Set3")) +
labs(y = "Cancel Rate", x = "Unchanged Room Type", title = "Cancel Rate by Unchanged Room Type - 75th Percentile")
unchanged_room_type_75.chart
This shows us that if a customer is moved from an A type room to another type of room, they are more likely to cancel.
# lead_time histogram and analysis
lead_time <- data.df.one %>%
select(lead_time, is_canceled) %>%
mutate(lead_time_bins = ifelse(lead_time <= 30, "1 Month",
ifelse(lead_time <= 60, "2 Months",
ifelse(lead_time <= 90, "3 Months",
ifelse(lead_time <= 180, "4 - 6 Months",
ifelse(lead_time <= 365, "6 - 12 Months",
"More than a Year"))))))
lead_time_cancellations <- lead_time %>%
select(is_canceled, lead_time_bins) %>%
filter(is_canceled == 1) %>%
group_by(lead_time_bins) %>%
summarize(cancellations = n(), .groups = 'drop')
lead_time_bookings <- lead_time %>%
select(is_canceled, lead_time_bins) %>%
group_by(lead_time_bins) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
lead_time.df <- full_join(lead_time_bookings, lead_time_cancellations, by = "lead_time_bins") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
lead_time.chart <- ggplot(lead_time.df, aes( y=cancel_rate, reorder(x=lead_time_bins, -cancel_rate), fill=lead_time_bins)) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=6, name="Set3")) +
labs(y = "Cancel Rate", x = "Lead Time", title = "Cancel Rate by Lead Time")
lead_time.df;
## # A tibble: 6 x 4
## lead_time_bins bookings cancellations cancel_rate
## <chr> <int> <int> <dbl>
## 1 More than a Year 3148 2130 0.677
## 2 6 - 12 Months 21544 11947 0.555
## 3 4 - 6 Months 26439 11821 0.447
## 4 3 Months 12583 4967 0.395
## 5 2 Months 16970 6174 0.364
## 6 1 Month 38702 7181 0.186
lead_time.chart;
The farther in advance a booking is made, the more likely the booking is to be cancelled. Bookings made over 4 months ahead of time have a cancellation rate much higher than average. Bookings made 1 month in advance have a significantly lower than average (18.6%) cancellation rate.
# arrival_data_day_of_month_histogram
arrival_date_day_of_month_cancellations <- data.df.one %>%
select(arrival_date_day_of_month, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(arrival_date_day_of_month) %>%
summarize(cancellations = n(), .groups = 'drop')
arrival_date_day_of_month_bookings <- data.df.one %>%
select(arrival_date_day_of_month, is_canceled) %>%
group_by(arrival_date_day_of_month) %>%
summarize(bookings = n(), .groups = 'drop')
arrival_date_day_of_month.df <- full_join(arrival_date_day_of_month_bookings, arrival_date_day_of_month_cancellations, by = "arrival_date_day_of_month") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
colourCount = length(unique(arrival_date_day_of_month.df$arrival_date_day_of_month))
getPalette = colorRampPalette(brewer.pal(n=31, name="Set3"))
arrival_date_day_of_month.chart <- ggplot(arrival_date_day_of_month.df, aes( y=cancel_rate, reorder(x=arrival_date_day_of_month, arrival_date_day_of_month), fill=as.factor(arrival_date_day_of_month))) +
geom_bar( stat="identity") +
labs(y = "Cancel Rate", x = "Date of Month", title = "Cancel Rate by Arrival Date of Month")
arrival_date_day_of_month.chart;
arrival_date_day_of_month.df;
## # A tibble: 31 x 4
## arrival_date_day_of_month bookings cancellations cancel_rate
## <int> <int> <int> <dbl>
## 1 8 3921 1647 0.420
## 2 17 4406 1837 0.417
## 3 3 3854 1544 0.401
## 4 1 3626 1444 0.398
## 5 15 4196 1670 0.398
## 6 7 3665 1441 0.393
## 7 26 4147 1626 0.392
## 8 5 4315 1677 0.389
## 9 28 3946 1511 0.383
## 10 14 3819 1461 0.383
## # ... with 21 more rows
There does not appear to be a relationship between the arrival date of month and the cancellation rate.
#Total length of stay histogram
length_of_stay <- data.df.one %>%
select(stays_in_week_nights,stays_in_weekend_nights,is_canceled) %>%
mutate(total_days = stays_in_week_nights+stays_in_weekend_nights)
length_of_stay_cancellations <- length_of_stay %>%
select(total_days, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(total_days) %>%
summarize(cancellations = n(), .groups = 'drop')
length_of_stay_bookings <- length_of_stay %>%
select(total_days, is_canceled) %>%
group_by(total_days) %>%
summarize(bookings = n(), .groups = 'drop')
length_of_stay.df <- full_join(length_of_stay_bookings, length_of_stay_cancellations, by = "total_days") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
# only chart where bookings are 50th percentile or higher. 50th percentile is about a month, anything longer, we did not consider a typical vacation.
quantile(length_of_stay.df$bookings,.50)
## 50%
## 20
colourCount = length(unique(length_of_stay.df$total_days))
getPalette = colorRampPalette(brewer.pal(n=23, name="Set3"))
length_of_stay.chart <-length_of_stay.df %>%
filter(bookings >= quantile(length_of_stay.df$bookings,.50),) %>%
ggplot(aes( y=cancel_rate, reorder(x=total_days, total_days), fill=as.factor(total_days))) +
geom_bar( stat="identity") +
labs(y = "Cancel Rate", x = "Length of Stay", title = "Cancel Rate by Length of Stay")
length_of_stay.chart;
tibble(length_of_stay.df);
## # A tibble: 45 x 4
## total_days bookings cancellations cancel_rate
## <int> <int> <int> <dbl>
## 1 26 6 6 1
## 2 33 3 3 1
## 3 20 14 13 0.929
## 4 30 13 11 0.846
## 5 24 6 5 0.833
## 6 18 35 29 0.829
## 7 27 5 4 0.8
## 8 29 14 11 0.786
## 9 23 8 6 0.75
## 10 19 22 16 0.727
## # ... with 35 more rows
There is an upward trend indicating guests are more likely to cancel when they have longer stays.
# Repeated guest histogram
is_repeated_guest_cancellations <- data.df.one %>%
select(is_repeated_guest, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(is_repeated_guest) %>%
summarize(cancellations = n(), .groups = 'drop')
is_repeated_guest_bookings <- data.df.one %>%
select(is_repeated_guest, is_canceled) %>%
group_by(is_repeated_guest) %>%
summarize(bookings = n(), .groups = 'drop')
is_repeated_guest.df <- full_join(is_repeated_guest_bookings, is_repeated_guest_cancellations, by = "is_repeated_guest") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
is_repeated_guest.chart <- ggplot(is_repeated_guest.df, aes( y=cancel_rate, reorder(x=is_repeated_guest, -cancel_rate), fill=as.factor(is_repeated_guest))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
labs(y = "Cancel Rate", x = "Repeated Guest (0 = no, 1 = yes)", title = "Cancel Rate by Repeated Guests")
is_repeated_guest.chart;
tibble(is_repeated_guest.df);
## # A tibble: 2 x 4
## is_repeated_guest bookings cancellations cancel_rate
## <int> <int> <int> <dbl>
## 1 0 115576 43668 0.378
## 2 1 3810 552 0.145
Repeated guests have a cancellation rate of 14.5%, whereas first time guests have a cancellation rate of 37.8%. Repeated guests’ cancellation rate of 14.5% is significantly less than the average cancellation rate of 37.04% indicating repeated guests are more dependable.
# previous cancellations histogram
is_previous_cancellations <- data.df.one %>%
mutate(previous_cancellations = ifelse(previous_cancellations >= 1,1,0))
previous_cancellations_cancellations <- is_previous_cancellations %>%
select(previous_cancellations, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(previous_cancellations) %>%
summarize(cancellations = n(), .groups = 'drop')
previous_cancellations_bookings <- is_previous_cancellations %>%
select(previous_cancellations, is_canceled) %>%
group_by(previous_cancellations) %>%
summarize(bookings = n(), .groups = 'drop')
previous_cancellations.df <- full_join(previous_cancellations_bookings, previous_cancellations_cancellations, by = "previous_cancellations") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
previous_cancellations.chart <- ggplot(previous_cancellations.df, aes( y=cancel_rate, reorder(x=previous_cancellations, -cancel_rate), fill=as.factor(previous_cancellations))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
labs(y = "Cancel Rate", x = "Previous Cancellations (0 = no, 1 = yes)", title = "Cancel Rate by Previous Cancellations")
previous_cancellations.chart;
tibble(previous_cancellations.df)
## # A tibble: 2 x 4
## previous_cancellations bookings cancellations cancel_rate
## <dbl> <int> <int> <dbl>
## 1 1 6484 5942 0.916
## 2 0 112902 38278 0.339
Guests with previous cancellations have an extremely high cancellation rate of 91.6%. Guests who have previously cancelled should assigned a much higher level of cancellation risk than those who have not previously cancelled.
#previous bookings not canceled histogram
is_previous_bookings_not_canceled <- data.df.one %>%
mutate(previous_bookings_not_canceled = ifelse(previous_bookings_not_canceled >= 1,1,0))
previous_bookings_not_canceled_cancellations <- is_previous_bookings_not_canceled %>%
select(previous_bookings_not_canceled, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(previous_bookings_not_canceled) %>%
summarize(cancellations = n(), .groups = 'drop')
previous_bookings_not_canceled_bookings <- is_previous_bookings_not_canceled %>%
select(previous_bookings_not_canceled, is_canceled) %>%
group_by(previous_bookings_not_canceled) %>%
summarize(bookings = n(), .groups = 'drop')
previous_bookings_not_canceled.df <- full_join(previous_bookings_not_canceled_cancellations, previous_bookings_not_canceled_bookings, by = "previous_bookings_not_canceled") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
previous_bookings_not_canceled.chart <- ggplot(previous_bookings_not_canceled.df, aes( y=cancel_rate, reorder(x=previous_bookings_not_canceled, -cancel_rate), fill=as.factor(previous_bookings_not_canceled))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
labs(y = "Cancel Rate", x = "Previous Bookings not Cancelled (0 = no, 1 = yes)", title = "Cancel Rate by Previous Bookings Not Cancelled")
previous_bookings_not_canceled.chart;
tibble(previous_bookings_not_canceled.df);
## # A tibble: 2 x 4
## previous_bookings_not_canceled cancellations bookings cancel_rate
## <dbl> <int> <int> <dbl>
## 1 0 44020 115766 0.380
## 2 1 200 3620 0.0552
Guests are less likely to cancel if they have previously bookings that were not cancelled. Observing guests’ cancellation history can be helpful in predicting the odds they will cancel.
#Whether customer waited on list for reservation? histogram
is_days_in_waiting_list <- data.df.one %>%
mutate(days_in_waiting_list = ifelse(days_in_waiting_list >= 1,1,0))
days_in_waiting_list_cancellations <- is_days_in_waiting_list %>%
select(days_in_waiting_list, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(days_in_waiting_list) %>%
summarize(cancellations = n(), .groups = 'drop')
days_in_waiting_list_bookings <- is_days_in_waiting_list %>%
select(days_in_waiting_list, is_canceled) %>%
group_by(days_in_waiting_list) %>%
summarize(bookings = n(), .groups = 'drop')
days_in_waiting_list.df <- full_join(days_in_waiting_list_cancellations, days_in_waiting_list_bookings, by = "days_in_waiting_list") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
days_in_waiting_list.chart <- ggplot(days_in_waiting_list.df, aes( y=cancel_rate, reorder(x=days_in_waiting_list, -cancel_rate), fill=as.factor(days_in_waiting_list))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
labs(y = "Cancel Rate", x = "On Waiting List (0 = no, 1 = yes)", title = "Cancel Rate by Being on Waiting List")
tibble(days_in_waiting_list.df);
## # A tibble: 2 x 4
## days_in_waiting_list cancellations bookings cancel_rate
## <dbl> <int> <int> <dbl>
## 1 1 2359 3698 0.638
## 2 0 41861 115688 0.362
days_in_waiting_list.chart
Approximately 63.8% of bookings are cancelled when customers are placed on a waiting list, hence customers are almost twice as likely to cancel when they are on a waiting list. Customers who are placed on a waiting list most likely search for other available options, and cancel their booking when they find a booking elsewhere that does not require them to be placed on a waiting list.
#Whether customer has special requests histogram
is_total_of_special_requests <- data.df.one %>%
mutate(total_of_special_requests = ifelse(total_of_special_requests >= 1,1,0))
total_of_special_requests_cancellations <- is_total_of_special_requests %>%
select(total_of_special_requests, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(total_of_special_requests) %>%
summarize(cancellations = n(), .groups = 'drop')
total_of_special_requests_bookings <- is_total_of_special_requests %>%
select(total_of_special_requests, is_canceled) %>%
group_by(total_of_special_requests) %>%
summarize(bookings = n(), .groups = 'drop')
total_of_special_requests.df <- full_join(total_of_special_requests_cancellations, total_of_special_requests_bookings, by = "total_of_special_requests") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
total_of_special_requests.chart <- ggplot(total_of_special_requests.df, aes( y=cancel_rate, reorder(x=total_of_special_requests, -cancel_rate), fill=as.factor(total_of_special_requests))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
labs(y = "Cancel Rate", x = "Special Request (0 = no, 1 = yes)", title = "Cancel Rate by Special Requests")
total_of_special_requests.df;
## # A tibble: 2 x 4
## total_of_special_requests cancellations bookings cancel_rate
## <dbl> <int> <int> <dbl>
## 1 0 33556 70318 0.477
## 2 1 10664 49068 0.217
total_of_special_requests.chart;
Customers with special requests have a cancellation rate of 21.7% indicating they are less likely to cancel than those without special requests. It is more difficult for customers with special requests to search for alternative options because other hotels may not be able to cater to their special requests.
# total people histogram
number_of_people <- data.df.one %>%
select(children,babies,adults,is_canceled) %>%
mutate(total_people = children+babies+adults)
number_of_people_cancellations <- number_of_people %>%
select(total_people, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(total_people) %>%
summarize(cancellations = n(), .groups = 'drop')
number_of_people_bookings <- number_of_people %>%
select(total_people, is_canceled) %>%
group_by(total_people) %>%
summarize(bookings = n(), .groups = 'drop')
number_of_people.df <- full_join(number_of_people_bookings, number_of_people_cancellations, by = "total_people") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
# only chart where bookings are 50th percentile or higher. 50th percentile is about a month, anything longer, we did not consider a typical vacation.
quantile(number_of_people.df$bookings,.50)
## 50%
## 2
number_of_people.chart <-number_of_people.df %>%
filter(bookings >= quantile(number_of_people.df$bookings,.50),) %>%
ggplot(aes( y=cancel_rate, reorder(x=total_people, total_people), fill=as.factor(total_people))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=11, name="Set3")) +
labs(y = "Cancel Rate", x = "Total Number of People", title = "Cancel Rate by Total Number of People")
number_of_people.chart;
number_of_people.df
## # A tibble: 15 x 4
## total_people bookings cancellations cancel_rate
## <int> <int> <int> <dbl>
## 1 6 1 1 1
## 2 20 2 2 1
## 3 26 5 5 1
## 4 27 2 2 1
## 5 40 1 1 1
## 6 50 1 1 1
## 7 55 1 1 1
## 8 10 2 1 0.5
## 9 12 2 1 0.5
## 10 4 3929 1646 0.419
## 11 2 82048 32569 0.397
## 12 3 10494 3376 0.322
## 13 1 22581 6555 0.290
## 14 5 137 34 0.248
## 15 0 180 25 0.139
A positive relationship is observed between the number of people in a booking and the cancellation rate. The more people in a reservation, the more likely the reservation is to be cancelled. The extremely small number of observations for bookings with more than 4 people challenges the signifance of this relationship.
#Have Children Histogram
is_children <- data.df.one %>%
mutate(children_babies = ifelse(babies >= 1,1,
ifelse(children >= 1,1,0)))
is_children_cancellations <- is_children %>%
select(children_babies, is_canceled) %>%
filter(is_canceled == 1) %>%
group_by(children_babies) %>%
summarize(cancellations = n(), .groups = 'drop')
is_children_bookings <- is_children %>%
select(children_babies, is_canceled) %>%
group_by(children_babies) %>%
summarize(bookings = n(), .groups = 'drop')
is_children.df <- full_join(is_children_cancellations, is_children_bookings, by = "children_babies") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
is_children.chart <- ggplot(is_children.df, aes( y=cancel_rate, reorder(x=children_babies, -cancel_rate), fill=as.factor(children_babies))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
labs(y = "Cancel Rate", x = "Children (0 = no, 1 = yes)", title = "Cancel Rate by Children")
is_children.df;
## # A tibble: 2 x 4
## children_babies cancellations bookings cancel_rate
## <dbl> <int> <int> <dbl>
## 1 0 40961 110054 0.372
## 2 1 3259 9332 0.349
is_children.chart;
Whether or not a guest has children does not appear to affect the cancellation rate. This finding is surprising as we expected guests with children to be more likely to cancel due to the challenges of traveling with children.
adr <- data.df.one %>%
select(adr, is_canceled) %>%
mutate(adr_bins = ifelse(adr <= 30, "under 30",
ifelse(adr <= 60, "30 - 90",
ifelse(adr <= 90, "60 - 90",
ifelse(adr <= 120, "90 - 120",
ifelse(adr <= 150, "120 - 150",
ifelse(adr <= 180,"150 - 180",
ifelse(adr <= 210,"180 - 210",
ifelse(adr <= 240,"210 - 240",
"240 and higher")))))))))
adr_cancellations <- adr %>%
select(is_canceled, adr_bins) %>%
filter(is_canceled == 1) %>%
group_by(adr_bins) %>%
summarize(cancellations = n(), .groups = 'drop')
adr_bookings <- adr %>%
select(is_canceled, adr_bins) %>%
group_by(adr_bins) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data
adr.df <- full_join(adr_bookings, adr_cancellations, by = "adr_bins") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
adr.chart <- ggplot(adr.df, aes( y=cancel_rate, reorder(x=adr_bins, -cancel_rate), fill=as.factor(adr_bins))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=9, name="Set3")) +
labs(y = "Cancel Rate", x = "Average Daily Rate", title = "Cancel Rate by Average Daily Rate")
adr.df;
## # A tibble: 9 x 4
## adr_bins bookings cancellations cancel_rate
## <chr> <int> <int> <dbl>
## 1 60 - 90 38077 15565 0.409
## 2 210 - 240 2189 863 0.394
## 3 120 - 150 16703 6430 0.385
## 4 90 - 120 30112 11587 0.385
## 5 150 - 180 8622 3302 0.383
## 6 240 and higher 1702 643 0.378
## 7 180 - 210 3995 1479 0.370
## 8 30 - 90 14459 3767 0.261
## 9 under 30 3527 584 0.166
adr.chart
The cheaper the hotel the less likely the customer is to cancel. The cancellation rate for bookings with an ADR of $30-$90 is 26%, and the cancellation rate for bookings with an ADR under $30 is 16.6%. Both of these percentages are significantly lower than the average cancellation rate of 37.04%. This is logical because lower income families are more likely to value vacation more as, it does not happen often.
levels(data.df.one$deposit_type)
## [1] "No Deposit" "Non Refund" "Refundable"
deposit_type_cancellations <- data.df.one %>%
select(is_canceled, deposit_type) %>%
filter(is_canceled == 1) %>%
group_by(deposit_type) %>%
summarize(cancellations = n(), .groups = 'drop')
deposit_type_bookings <- data.df.one %>%
select(is_canceled, deposit_type) %>%
group_by(deposit_type) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
deposit_type.df <- full_join(deposit_type_bookings, deposit_type_cancellations, by = "deposit_type") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
deposit_type.chart <- ggplot(deposit_type.df, aes( y=cancel_rate, reorder(x=deposit_type, -cancel_rate), fill=as.factor(deposit_type))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=3, name="Set3")) +
labs(y = "Cancel Rate", x = "Deposit Type", title = "Cancel Rate by Deposit Type")
deposit_type.df;
## # A tibble: 3 x 4
## deposit_type bookings cancellations cancel_rate
## <fct> <int> <int> <dbl>
## 1 Non Refund 14587 14494 0.994
## 2 No Deposit 104637 29690 0.284
## 3 Refundable 162 36 0.222
deposit_type.chart;
Customers are much more likely to cancel if there is a non-refunadable deposit. This effect is most likely due to reverse causality as hotels with extremely high cancellation rates are more likely to have non-refundable deposits to protect against losses.
#whether customer used agent or company
agent_company <- data.df.one %>%
select(agent,company, is_canceled) %>%
mutate(agent = ifelse(agent=="none",0,agent)) %>%
mutate(company = ifelse(company=="none",0,company)) %>%
mutate(agent_company_cancat = (ifelse(agent > 0 | company > 0,1,0)))
agent_company_cancellations <- agent_company %>%
select(is_canceled, agent_company_cancat) %>%
filter(is_canceled == 1) %>%
group_by(agent_company_cancat) %>%
summarize(cancellations = n(), .groups = 'drop')
agent_company_bookings <- agent_company %>%
select(is_canceled, agent_company_cancat) %>%
group_by(agent_company_cancat) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data
agent_company.df <- full_join(agent_company_bookings, agent_company_cancellations, by = "agent_company_cancat") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
agent_company.chart <- ggplot(agent_company.df, aes( y=cancel_rate, reorder(x=agent_company_cancat, -cancel_rate), fill=as.factor(agent_company_cancat))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
labs(y = "Cancel Rate", x = "Booked Through Agent/Company (1 = yes, 0 = no)", title = "Cancel Rate by How Booking was Made")
agent_company.df;
## # A tibble: 2 x 4
## agent_company_cancat bookings cancellations cancel_rate
## <dbl> <int> <int> <dbl>
## 1 1 109628 41366 0.377
## 2 0 9758 2854 0.292
agent_company.chart
Customers who make bookings through travel agents or companies have a cancellation rate of 29.2% which is about 8% less than the average cancellation rate. Hence, customers may be less likely to cancel when they make bookings through agents or companies.
# customer_type histogram and analysis
levels(data.df.one$customer_type)
## [1] "Contract" "Group" "Transient" "Transient-Party"
customer_type_cancellations <- data.df.one %>%
select(is_canceled, customer_type) %>%
filter(is_canceled == 1) %>%
group_by(customer_type) %>%
summarize(cancellations = n(), .groups = 'drop')
customer_type_bookings <- data.df.one %>%
select(is_canceled, customer_type) %>%
group_by(customer_type) %>%
summarize(bookings = n(), .groups = 'drop')
# Join monthly data on month
customer_type.df <- full_join(customer_type_bookings, customer_type_cancellations, by = "customer_type") %>%
mutate(cancel_rate = cancellations/bookings) %>%
arrange(desc(cancel_rate))
customer_type.chart <- ggplot(customer_type.df, aes( y=cancel_rate, reorder(x=customer_type, -cancel_rate), fill=as.factor(customer_type))) +
geom_bar( stat="identity") +
scale_fill_manual(values = brewer.pal(n=4, name="Set3")) +
labs(y = "Cancel Rate", x = "Customer Type", title = "Cancel Rate by Customer Type")
customer_type.df;
## # A tibble: 4 x 4
## customer_type bookings cancellations cancel_rate
## <fct> <int> <int> <dbl>
## 1 Transient 89613 36514 0.407
## 2 Contract 4076 1262 0.310
## 3 Transient-Party 25120 6385 0.254
## 4 Group 577 59 0.102
customer_type.chart
Cancellation rate varies depending on the customer type. Contract, group, and transient-parties are less likely to cancel than average customers, whereas transient customers are more likely to cancel. The majority of bookings are made by transient customers which explains why it has such a large impact on the mean cancellation rate.
Cancellations are an imminent threat in the travel and tourism industry. To survive in the competitive industry, hotels must utilize cancellation policies and tactful overbooking strategies. A major hotel chain enlisted our help in providing advice on creating the most efficient and profitable overbooking strategy in order to remain competitive in the industry. To determine the best overbooking strategy we analyzed comprehensive hotel booking data from 2015 - 2017. Our strategy involved exploring relationships between individual variables and the cancellation rate to determine which factors have the largest influence on booking cancellations.
The analysis provided numerous major insights. The average cancellation rate of 37% was used as a baseline for comparing results. A significant difference was found between the cancellation rate of city hotels and resort hotels. Approximately 42% of bookings at city hotels were cancelled, whereas only 28% of bookings at resort hotels were cancelled. Additionally, we found cancellations are highest from April - June. A negative relationship was observed between lead time and the cancellation rate. The farther in advance a booking is made, the more likely it is to be cancelled. Customer history and characteristics can also be used to determine cancellation risk. The cancellation rate for repeat guests is approximately 23% lower than the average cancellation rate, and guests with previous cancellations have an extremely high cancellation rate of 91.6%. Additionally, being placed on the waiting list also greatly increases the likelihood of a booking being cancelled, and customers with special requests are less likely to cancel their bookings. A negative relationship was found between the average daily rate and the cancellation rate. Bookings with very low (under $90) average daily rates had much lower cancellation rates than average. Finally, customers are also 8% less likely to cancel their bookings when they book through a travel agency/company.
These findings provide insight on which variables our client should consider in crafting their overbooking strategy. More aggressive overbooking strategies should be utilized by city hotels, during the springtime, for bookings with long lead times, for guests with previous cancelations, and for guests placed on waiting lists. Less aggressive overbooking strategies should be utilized by resort hotels, for repeat guests, for guests with special requests, for bookings with extremely low ADR’s, and for bookings made through a travel agency/company. These factors are unlikely to align with each other, hence hotels must assess and weigh the importance of each of these factors in determining cancellation risk. Overall, assessing hotel type, customer history and characteristics, and booking characteristics is a strategic tool hotels can use to create optimal overbooking strategies to remain competitive in the industry.
A major limitation of our analysis is our examination of correlation rather than causation. We examined the impact of various variables on the cancellation rate to see if there was a relationship between the two variables, however we did not assess whether specific variables truly caused an increase or decrease in the cancellation rate. Some variables might decrease or increase as a result of the cancellation rate changing, indicating reverse causality. The relationship between nonrefundable deposits and the cancellation rate are one example of this. Higher cancellation rates are more likely to result in the use of nonrefundable deposits rather than nonrefundable deposits leading to higher cancellation rates. Another limitation of our analysis was our inability to measure multicollinearity between variables. We analyzed the impact of numerous independent variables on the cancellation rate, however because we did not fit any models to our data or perform a residual analysis it is impossible to determine whether these variables are truly independent of one another. Many of these variables may cause other independent variables to change, further challenging the significance of our results. Future research should examine causation between various variables and the cancellation rate, and fit models to the data and perform tests for multicollinearity.