Introduction

In 2019 the travel and tourism sector accounted for over 10% of the world’s GDP and 1 in 10 jobs across the globe making it one of the largest industries in the world. The industry grew 3.5% in 2019 and is expected to continue growing over the next 10 years [(World Travel and Tourism Council, 2019)] (https://wttc.org/Research/Economic-Impact). As the industry continues to expand, so does competition. Hotel firms are engaged in an arduous battle to cut costs and create more value for guests than competing firms. One of the most imminent barriers to boasting profits and outperforming rivals are cancellations. Booking cancellations are one of the biggest threats to the hospitality industry. To protect against the threat of cancellations, hotels utilize cancellation policies and overbooking tactics. However, if used incorrectly overbooking strategies can produce detrimental effects and undermine a firm’s market position. Hotels often overbook based on the assumption that a certain percentage of guests will cancel their reservations. Overbooking protects hotels from losses due to operating below capacity, however it also creates additional risk. If there are fewer cancellations than forecasted, hotels will be unable to provide rooms for guests which can result in the loss of loyal customers and brand tarnishment [(Antonio et al., 2017)] (https://ieeexplore.ieee.org/document/8260781). Determining the optimal rate of overbooking serves the best interests of both hotel firms and consumers by enabling firms to maximize profits while adequately satisfying consumer needs. Effective overbooking strategies are also critical to staying competitive in the industry. To navigate the growing competition in the travel and tourism industry, a major hotel corporation has requested our services to determine the optimal rate of overbooking in order to minimize risk and maximize profits.

To determine the optimal overbooking rate for our client we will analyze comprehensive hotel booking data collected from 2015-2017. We will evaluate numerous variables to determine their relationship with the cancellation rate variable. A variety of approaches and analytic techniques will be utilized to analyze and make sense of the data. Techniques include exploratory data analysis, analysis of the variance, and linear regression. Gaining insight on the relationship of different variables with the cancellation rate variable will enable us to advise our client on the factors that must be considered when determining the ideal overbooking rate. Overall this will enable our client to utilize the most efficient cancellation rate to maximize profits, minimize risk, and remain competitive in the industry.

Packages

Packages Used

  • tidyverse = allows for data manipulation and works in harmony with other packages as well
  • ggplot2 = graphical representation in r
  • dplyr = data manipulation in r
  • library(DT) = used for displaying R data objects (matrices or data frames) as tables on HTML page
  • kableExtra =
  • threadr = threadr package was added and used to create uuid for each data entry. UUID is the Universally Unique Identifier.
  • RColorBrewer = color palette for ggplot

Data Preparation

Data Source

Our data was obtained from Antonio, Almeida, and Nunes (2019)

setwd("C:/Users/messe/Documents/Masters/Fall 2020/Data Wrangling/Project");
data.df <- read.csv("C:/Users/messe/Documents/Masters/Fall 2020/Data Wrangling/Project/hotels.csv", stringsAsFactors = FALSE)
options(warn = -1)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(RColorBrewer)
#install.packages("threadr")
library(threadr)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## 
## Attaching package: 'threadr'
## The following objects are masked from 'package:readr':
## 
##     parse_time, read_lines, type_convert

Explanation of Data

The data was originally collected for the role of research in various fields such as education in revenue management, machine learning, and data mining. The data is a culmination of two data sets - one on a resort hotel and the other on a city hotel. In total the data has 31 variables, with 119,390 observations. In this context, observations represent hotel bookings. The data was collected from July of 2015 to August 2017. The data appears to be relatively clean. However, different variables use inconsistent naming conventions to classify missing values. Missing values are classified as NA, undefined, and Null throughout the data. We will utilize data cleaning techniques to uniformly classify missing values and make a strategic decision on whether to omit, change, or keep missing values.

Data Cleaning Steps

We will clean the data to create consistency and ensure it can be thoroughly and efficiently analyzed.

To clean the data we will: * Ensure variables names follow a proper, uniform naming convention + Snake case was selected as the designated naming convention + Source data was received in selected naming convention

  • Replaced missing and NA values and consolidated categorical variables with similar meaning for the following variables:
    • meal: “Undefined” and “SC” were consolidated into a “none” category because the data dictionary defines these categories at the same.
data.df$meal <- replace(data.df$meal, data.df$meal=="Undefined", "none");
data.df$meal <- replace(data.df$meal, data.df$meal=="SC", "none")
  • children and lead_time had missing or null values: “NA” and missing values were removed because there were only 4 columns effected.
data.df %>%
head(10) #Dataset with missing or null values 119,390 values
##           hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1  Resort Hotel           0       342              2015               July
## 2  Resort Hotel           0       737              2015               July
## 3  Resort Hotel           0         7              2015               July
## 4  Resort Hotel           0        13              2015               July
## 5  Resort Hotel           0        14              2015               July
## 6  Resort Hotel           0        14              2015               July
## 7  Resort Hotel           0         0              2015               July
## 8  Resort Hotel           0         9              2015               July
## 9  Resort Hotel           1        85              2015               July
## 10 Resort Hotel           1        75              2015               July
##    arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                        27                         1                       0
## 2                        27                         1                       0
## 3                        27                         1                       0
## 4                        27                         1                       0
## 5                        27                         1                       0
## 6                        27                         1                       0
## 7                        27                         1                       0
## 8                        27                         1                       0
## 9                        27                         1                       0
## 10                       27                         1                       0
##    stays_in_week_nights adults children babies meal country market_segment
## 1                     0      2        0      0   BB     PRT         Direct
## 2                     0      2        0      0   BB     PRT         Direct
## 3                     1      1        0      0   BB     GBR         Direct
## 4                     1      1        0      0   BB     GBR      Corporate
## 5                     2      2        0      0   BB     GBR      Online TA
## 6                     2      2        0      0   BB     GBR      Online TA
## 7                     2      2        0      0   BB     PRT         Direct
## 8                     2      2        0      0   FB     PRT         Direct
## 9                     3      2        0      0   BB     PRT      Online TA
## 10                    3      2        0      0   HB     PRT  Offline TA/TO
##    distribution_channel is_repeated_guest previous_cancellations
## 1                Direct                 0                      0
## 2                Direct                 0                      0
## 3                Direct                 0                      0
## 4             Corporate                 0                      0
## 5                 TA/TO                 0                      0
## 6                 TA/TO                 0                      0
## 7                Direct                 0                      0
## 8                Direct                 0                      0
## 9                 TA/TO                 0                      0
## 10                TA/TO                 0                      0
##    previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                               0                  C                  C
## 2                               0                  C                  C
## 3                               0                  A                  C
## 4                               0                  A                  A
## 5                               0                  A                  A
## 6                               0                  A                  A
## 7                               0                  C                  C
## 8                               0                  C                  C
## 9                               0                  A                  A
## 10                              0                  D                  D
##    booking_changes deposit_type agent company days_in_waiting_list
## 1                3   No Deposit  NULL    NULL                    0
## 2                4   No Deposit  NULL    NULL                    0
## 3                0   No Deposit  NULL    NULL                    0
## 4                0   No Deposit   304    NULL                    0
## 5                0   No Deposit   240    NULL                    0
## 6                0   No Deposit   240    NULL                    0
## 7                0   No Deposit  NULL    NULL                    0
## 8                0   No Deposit   303    NULL                    0
## 9                0   No Deposit   240    NULL                    0
## 10               0   No Deposit    15    NULL                    0
##    customer_type   adr required_car_parking_spaces total_of_special_requests
## 1      Transient   0.0                           0                         0
## 2      Transient   0.0                           0                         0
## 3      Transient  75.0                           0                         0
## 4      Transient  75.0                           0                         0
## 5      Transient  98.0                           0                         1
## 6      Transient  98.0                           0                         1
## 7      Transient 107.0                           0                         0
## 8      Transient 103.0                           0                         1
## 9      Transient  82.0                           0                         1
## 10     Transient 105.5                           0                         0
##    reservation_status reservation_status_date
## 1           Check-Out                7/1/2015
## 2           Check-Out                7/1/2015
## 3           Check-Out                7/2/2015
## 4           Check-Out                7/2/2015
## 5           Check-Out                7/3/2015
## 6           Check-Out                7/3/2015
## 7           Check-Out                7/3/2015
## 8           Check-Out                7/3/2015
## 9            Canceled                5/6/2015
## 10           Canceled               4/22/2015
data.df <- na.omit(data.df) #New dataset with 119,386 values
  • Two variables were removed because this information is irrelevant in predicting cancellations. This is because the data in these variables is recorded after the cancellation occurs.
  • These variables are:
    • reservation_status and reservation_status_date
data.df <- data.df[-c(31,32)]
  • The category “Undefined” in market_segment and distribution_channel were changed to “none”

  • distribution_channel was removed to avoid correlation between market_segment and distribution_channel in our analysis.

data.df <- data.df[-c(16)]
  • arrival_date_week_number was removed because the information provided by this variable is provided by arrival_date_day_of_month and arrival_date_month
data.df <- data.df[-c(6)]
  • Removed the required_car_parking_spaces because we are not able to determine if each hotel provides parking. Additionally, median and 3rd quartile values are 0 so, we do not expect many irregularities.
data.df <- data.df[-c(27)]
  • Null values were removed from the variables agent and company and replaced with “none”
  • Numeric values with “Na” were replaced with 0
data.df$agent <- replace(data.df$agent, data.df$agent=="NULL","none");
data.df$company <- replace(data.df$company, data.df$company=="NULL","none")
  • Categorical data was transformed into factor variables.
  • These variables were designated as categorical:
  • hotel, arrival_date_year, arrival_date_month, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, booking_changes, deposit_type, agent, company, customer_type, reservation_status
data.df.one <- data.df
data.df.one$hotel <- as.factor(data.df$hotel);
data.df.one$arrival_date_year <- as.factor(data.df$arrival_date_year);
data.df.one$arrival_date_month <- as.factor(data.df$arrival_date_month);
data.df.one$meal <- as.factor(data.df$meal);
data.df.one$country <- as.factor(data.df$country);
data.df.one$market_segment <- as.factor(data.df$market_segment);
data.df.one$reserved_room_type <- as.factor(data.df$reserved_room_type);
data.df.one$assigned_room_type <- as.factor(data.df$assigned_room_type);
data.df.one$booking_changes <- as.factor(data.df$booking_changes);
data.df.one$deposit_type <- as.factor(data.df$deposit_type);
data.df.one$agent <- as.factor(data.df$agent);
data.df.one$company <- as.factor(data.df$company);
data.df.one$customer_type <- as.factor(data.df$customer_type)

# Add unique row ID
data.df.one$ID <- uuid_rowwise(data.df.one)
  • Overall average cancellation rate is 0.3704 or 37.04 percent
# Translate numeric variables into categorical data based on quantiles

data.df.one %>%
  summary()
##           hotel        is_canceled       lead_time   arrival_date_year
##  City Hotel  :79326   Min.   :0.0000   Min.   :  0   2015:21992       
##  Resort Hotel:40060   1st Qu.:0.0000   1st Qu.: 18   2016:56707       
##                       Median :0.0000   Median : 69   2017:40687       
##                       Mean   :0.3704   Mean   :104                    
##                       3rd Qu.:1.0000   3rd Qu.:160                    
##                       Max.   :1.0000   Max.   :737                    
##                                                                       
##  arrival_date_month arrival_date_day_of_month stays_in_weekend_nights
##  August :13873      Min.   : 1.0              Min.   : 0.0000        
##  July   :12661      1st Qu.: 8.0              1st Qu.: 0.0000        
##  May    :11791      Median :16.0              Median : 1.0000        
##  October:11160      Mean   :15.8              Mean   : 0.9276        
##  April  :11089      3rd Qu.:23.0              3rd Qu.: 2.0000        
##  June   :10939      Max.   :31.0              Max.   :19.0000        
##  (Other):47873                                                       
##  stays_in_week_nights     adults          children           babies         
##  Min.   : 0.0         Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000000  
##  1st Qu.: 1.0         1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.: 0.000000  
##  Median : 2.0         Median : 2.000   Median : 0.0000   Median : 0.000000  
##  Mean   : 2.5         Mean   : 1.856   Mean   : 0.1039   Mean   : 0.007949  
##  3rd Qu.: 3.0         3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.: 0.000000  
##  Max.   :50.0         Max.   :55.000   Max.   :10.0000   Max.   :10.000000  
##                                                                             
##    meal          country            market_segment  is_repeated_guest
##  BB  :92306   PRT    :48586   Aviation     :  237   Min.   :0.00000  
##  FB  :  798   GBR    :12129   Complementary:  743   1st Qu.:0.00000  
##  HB  :14463   FRA    :10415   Corporate    : 5295   Median :0.00000  
##  none:11819   ESP    : 8568   Direct       :12605   Mean   :0.03191  
##               DEU    : 7287   Groups       :19811   3rd Qu.:0.00000  
##               ITA    : 3766   Offline TA/TO:24219   Max.   :1.00000  
##               (Other):28635   Online TA    :56476                    
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                A      :85994     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                D      :19201     
##  Median : 0.00000       Median : 0.0000                E      : 6535     
##  Mean   : 0.08712       Mean   : 0.1371                F      : 2897     
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                G      : 2094     
##  Max.   :26.00000       Max.   :72.0000                B      : 1114     
##                                                        (Other): 1551     
##  assigned_room_type booking_changes      deposit_type        agent      
##  A      :74053      0      :101310   No Deposit:104637   9      :31960  
##  D      :25322      1      : 12701   Non Refund: 14587   none   :16338  
##  E      : 7806      2      :  3805   Refundable:   162   240    :13922  
##  F      : 3751      3      :   927                       1      : 7191  
##  G      : 2553      4      :   376                       14     : 3639  
##  C      : 2375      5      :   118                       7      : 3539  
##  (Other): 3526      (Other):   149                       (Other):42797  
##     company       days_in_waiting_list         customer_type  
##  none   :112589   Min.   :  0.000      Contract       : 4076  
##  40     :   927   1st Qu.:  0.000      Group          :  577  
##  223    :   784   Median :  0.000      Transient      :89613  
##  67     :   267   Mean   :  2.321      Transient-Party:25120  
##  45     :   250   3rd Qu.:  0.000                             
##  153    :   215   Max.   :391.000                             
##  (Other):  4354                                               
##       adr          total_of_special_requests      ID           
##  Min.   :  -6.38   Min.   :0.0000            Length:119386     
##  1st Qu.:  69.29   1st Qu.:0.0000            Class :character  
##  Median :  94.59   Median :0.0000            Mode  :character  
##  Mean   : 101.83   Mean   :0.5713                              
##  3rd Qu.: 126.00   3rd Qu.:1.0000                              
##  Max.   :5400.00   Max.   :5.0000                              
## 
# Identify numeric data
data.df.one %>%
  select(where(is.numeric)) %>%  
  head(10)
##    is_canceled lead_time arrival_date_day_of_month stays_in_weekend_nights
## 1            0       342                         1                       0
## 2            0       737                         1                       0
## 3            0         7                         1                       0
## 4            0        13                         1                       0
## 5            0        14                         1                       0
## 6            0        14                         1                       0
## 7            0         0                         1                       0
## 8            0         9                         1                       0
## 9            1        85                         1                       0
## 10           1        75                         1                       0
##    stays_in_week_nights adults children babies is_repeated_guest
## 1                     0      2        0      0                 0
## 2                     0      2        0      0                 0
## 3                     1      1        0      0                 0
## 4                     1      1        0      0                 0
## 5                     2      2        0      0                 0
## 6                     2      2        0      0                 0
## 7                     2      2        0      0                 0
## 8                     2      2        0      0                 0
## 9                     3      2        0      0                 0
## 10                    3      2        0      0                 0
##    previous_cancellations previous_bookings_not_canceled days_in_waiting_list
## 1                       0                              0                    0
## 2                       0                              0                    0
## 3                       0                              0                    0
## 4                       0                              0                    0
## 5                       0                              0                    0
## 6                       0                              0                    0
## 7                       0                              0                    0
## 8                       0                              0                    0
## 9                       0                              0                    0
## 10                      0                              0                    0
##      adr total_of_special_requests
## 1    0.0                         0
## 2    0.0                         0
## 3   75.0                         0
## 4   75.0                         0
## 5   98.0                         1
## 6   98.0                         1
## 7  107.0                         0
## 8  103.0                         1
## 9   82.0                         1
## 10 105.5                         0
# Identify Categorical data
data.df.one %>%
  select(where(is.factor)) %>%
  head(10)
##           hotel arrival_date_year arrival_date_month meal country
## 1  Resort Hotel              2015               July   BB     PRT
## 2  Resort Hotel              2015               July   BB     PRT
## 3  Resort Hotel              2015               July   BB     GBR
## 4  Resort Hotel              2015               July   BB     GBR
## 5  Resort Hotel              2015               July   BB     GBR
## 6  Resort Hotel              2015               July   BB     GBR
## 7  Resort Hotel              2015               July   BB     PRT
## 8  Resort Hotel              2015               July   FB     PRT
## 9  Resort Hotel              2015               July   BB     PRT
## 10 Resort Hotel              2015               July   HB     PRT
##    market_segment reserved_room_type assigned_room_type booking_changes
## 1          Direct                  C                  C               3
## 2          Direct                  C                  C               4
## 3          Direct                  A                  C               0
## 4       Corporate                  A                  A               0
## 5       Online TA                  A                  A               0
## 6       Online TA                  A                  A               0
## 7          Direct                  C                  C               0
## 8          Direct                  C                  C               0
## 9       Online TA                  A                  A               0
## 10  Offline TA/TO                  D                  D               0
##    deposit_type agent company customer_type
## 1    No Deposit  none    none     Transient
## 2    No Deposit  none    none     Transient
## 3    No Deposit  none    none     Transient
## 4    No Deposit   304    none     Transient
## 5    No Deposit   240    none     Transient
## 6    No Deposit   240    none     Transient
## 7    No Deposit  none    none     Transient
## 8    No Deposit   303    none     Transient
## 9    No Deposit   240    none     Transient
## 10   No Deposit    15    none     Transient
# Number of Datapoints and number of variables
dim_type <- c("datapoints","variables")
tibble(dim_type, dim(data.df.one))
## # A tibble: 2 x 2
##   dim_type   `dim(data.df.one)`
##   <chr>                   <int>
## 1 datapoints             119386
## 2 variables                  28
# Get quantiles for each numeric variable
#arrival_date_day_of_month
quantile(data.df.one$arrival_date_day_of_month)
##   0%  25%  50%  75% 100% 
##    1    8   16   23   31
#stays_in_weekend_nights
quantile(data.df.one$stays_in_weekend_nights)
##   0%  25%  50%  75% 100% 
##    0    0    1    2   19
#stays_in_week_nights
quantile(data.df.one$stays_in_week_nights)
##   0%  25%  50%  75% 100% 
##    0    1    2    3   50
#adults
quantile(data.df.one$adults)
##   0%  25%  50%  75% 100% 
##    0    2    2    2   55
#children
quantile(data.df.one$children, na.rm = TRUE)
##   0%  25%  50%  75% 100% 
##    0    0    0    0   10
#babies
quantile(data.df.one$babies)
##   0%  25%  50%  75% 100% 
##    0    0    0    0   10
#is_repeated_guest
quantile(data.df.one$is_repeated_guest)
##   0%  25%  50%  75% 100% 
##    0    0    0    0    1
#previous_cancellations
quantile(data.df.one$previous_cancellations)
##   0%  25%  50%  75% 100% 
##    0    0    0    0   26
#previous_bookings_not_canceled
quantile(data.df.one$previous_bookings_not_canceled)
##   0%  25%  50%  75% 100% 
##    0    0    0    0   72
#days_in_waiting_list
quantile(data.df.one$days_in_waiting_list)
##   0%  25%  50%  75% 100% 
##    0    0    0    0  391
#total_of_special_requests
quantile(data.df.one$total_of_special_requests)
##   0%  25%  50%  75% 100% 
##    0    0    0    1    5

Exploratory Data Analysis & Visualizations

We will discover information in the data that is not self-evident by scrutinizing the impact of various individual variables on the cancellation rate. We separate the data in to quantiles to account for outliers and enable a closer examination of each variable. Our visualizations are especially helpful in depicting these relationships.

# Number of cancellations per hotel type
x1 <- data.df.one %>%
        select(is_canceled, hotel) %>% 
        group_by(hotel) %>%
        filter(is_canceled == 1) %>%
        summarise(no_rows = length(hotel), .groups = 'drop') %>%
        spread(hotel, no_rows)

colnames(x1) <- paste(colnames(x1), "canceled", sep = "_");

# Number of non-cancellations per hotel type
x2 <- data.df.one %>%
        select(is_canceled, hotel) %>% 
        group_by(hotel) %>%
        filter(is_canceled == 0) %>%
        summarise(no_rows = length(hotel), .groups = 'drop') %>%
        spread(hotel, no_rows) 

colnames(x2) <- paste(colnames(x2), "not_canceled", sep = "_");

# Combine data sets
hotel_type_cancellations <- tibble(x1,x2);

# Function for calculating percent cancellation per type
percent_cancellation <- function(x,y) {
  n <- (sum(x)+sum(y))
  (sum(x)/n)
}

# Function for calculating percent cancellation per type
population_size_n <- function(x,y) {
  (sum(x)+sum(y))
}

# Run percent_cancellation to determine percent cancellation of City Hotels
percent_cancellation(hotel_type_cancellations$`City Hotel_canceled`,hotel_type_cancellations$`City Hotel_not_canceled`)
## [1] 0.4172402
# Run population_size_n to determine the total population of each category
population_size_n(hotel_type_cancellations$`City Hotel_canceled`,hotel_type_cancellations$`City Hotel_not_canceled`)
## [1] 79326
# Run percent_cancellation to determine percent cancellation of Resort Hotels
percent_cancellation(hotel_type_cancellations$`Resort Hotel_canceled`,hotel_type_cancellations$`Resort Hotel_not_canceled`)
## [1] 0.2776335
# Run population_size_n to determine the total population of each category
population_size_n(hotel_type_cancellations$`Resort Hotel_canceled`,hotel_type_cancellations$`Resort Hotel_not_canceled`)
## [1] 40060

Approximately, 28% of bookings at resort hotels were cancelled, whereas 42% of bookings at city hotels were cancelled. Typically people stay at resort hotels for vacations and at city hotels for quick weekend getaways or business trips. The discrepancy in cancellation rates between the two variables may be due to the reluctance of people to cancel vacations that have been planned far in advance.

Graphics

data.df.one %>%
  select(where(is.factor)) %>%
  head(10)
##           hotel arrival_date_year arrival_date_month meal country
## 1  Resort Hotel              2015               July   BB     PRT
## 2  Resort Hotel              2015               July   BB     PRT
## 3  Resort Hotel              2015               July   BB     GBR
## 4  Resort Hotel              2015               July   BB     GBR
## 5  Resort Hotel              2015               July   BB     GBR
## 6  Resort Hotel              2015               July   BB     GBR
## 7  Resort Hotel              2015               July   BB     PRT
## 8  Resort Hotel              2015               July   FB     PRT
## 9  Resort Hotel              2015               July   BB     PRT
## 10 Resort Hotel              2015               July   HB     PRT
##    market_segment reserved_room_type assigned_room_type booking_changes
## 1          Direct                  C                  C               3
## 2          Direct                  C                  C               4
## 3          Direct                  A                  C               0
## 4       Corporate                  A                  A               0
## 5       Online TA                  A                  A               0
## 6       Online TA                  A                  A               0
## 7          Direct                  C                  C               0
## 8          Direct                  C                  C               0
## 9       Online TA                  A                  A               0
## 10  Offline TA/TO                  D                  D               0
##    deposit_type agent company customer_type
## 1    No Deposit  none    none     Transient
## 2    No Deposit  none    none     Transient
## 3    No Deposit  none    none     Transient
## 4    No Deposit   304    none     Transient
## 5    No Deposit   240    none     Transient
## 6    No Deposit   240    none     Transient
## 7    No Deposit  none    none     Transient
## 8    No Deposit   303    none     Transient
## 9    No Deposit   240    none     Transient
## 10   No Deposit    15    none     Transient
monthly_cancellations <- data.df.one %>%
  select(is_canceled, arrival_date_month) %>%
  filter(is_canceled == 1) %>%
  group_by(arrival_date_month) %>%
  summarize(cancellations = n(), .groups = 'drop')

monthly_bookings <- data.df.one %>%
  select(is_canceled, arrival_date_month) %>%
  group_by(arrival_date_month) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
monthly_cancellations.df <- full_join(monthly_bookings, monthly_cancellations, by = "arrival_date_month") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  mutate(month_number = ifelse(arrival_date_month == "January",1,
                               ifelse(arrival_date_month == "February",2,
                               ifelse(arrival_date_month == "March",3,
                               ifelse(arrival_date_month == "April",4,
                               ifelse(arrival_date_month == "May",5,
                               ifelse(arrival_date_month == "June",6,
                               ifelse(arrival_date_month == "July",7,
                               ifelse(arrival_date_month == "August",8,
                               ifelse(arrival_date_month == "September",9,
                               ifelse(arrival_date_month == "October", 10,
                               ifelse(arrival_date_month == "November",11,
                               12)))))))))))) %>%
  arrange((month_number))



monthly_cancellations.chart <- ggplot(monthly_cancellations.df, aes(y=cancel_rate, reorder(x=arrival_date_month, month_number), fill=arrival_date_month)) + 
    geom_bar( stat="identity") +
    scale_x_discrete(labels=c("Jan", "Feb", "Mar", "Apr", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
   scale_fill_manual(values = brewer.pal(n=12,name="Set3")) +
  labs(y = "Cancel Rate", x = "Arrival Month", title = "Cancel Rate by Arrival Month") 

monthly_cancellations.df;
## # A tibble: 12 x 5
##    arrival_date_month bookings cancellations cancel_rate month_number
##    <fct>                 <int>         <int>       <dbl>        <dbl>
##  1 January                5929          1807       0.305            1
##  2 February               8068          2696       0.334            2
##  3 March                  9794          3149       0.322            3
##  4 April                 11089          4524       0.408            4
##  5 May                   11791          4677       0.397            5
##  6 June                  10939          4535       0.415            6
##  7 July                  12661          4742       0.375            7
##  8 August                13873          5235       0.377            8
##  9 September             10508          4116       0.392            9
## 10 October               11160          4246       0.380           10
## 11 November               6794          2122       0.312           11
## 12 December               6780          2371       0.350           12
monthly_cancellations.chart;

Cancellations fluctuate based on the season. Cancellations are highest in the spring and early summer and increase again in late summer.

meal_cancellations <- data.df.one %>%
  select(is_canceled, meal) %>%
  filter(is_canceled == 1) %>%
  group_by(meal) %>%
  summarize(cancellations = n(), .groups = 'drop')

meal_bookings <- data.df.one %>%
  select(is_canceled, meal) %>%
  group_by(meal) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
meal_cancellations.df <- full_join(meal_bookings, meal_cancellations, by = "meal") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

meal_cancellations.chart <- ggplot(meal_cancellations.df, aes( y=cancel_rate, reorder(x=meal, -cancel_rate), fill =meal)) + 
    geom_bar( stat="identity") +
    scale_x_discrete(labels=c("Full Board", "Bed & Breakfast", "None", "Half Board")) +
    scale_fill_manual(values = brewer.pal(n=4,name="Set3")) +
    labs(y = "Cancel Rate", x = "Meal Type ", title = "Cancel Rate by Meal Type") 

meal_cancellations.chart;

meal_cancellations.df;
## # A tibble: 4 x 4
##   meal  bookings cancellations cancel_rate
##   <fct>    <int>         <int>       <dbl>
## 1 FB         798           478       0.599
## 2 BB       92306         34506       0.374
## 3 none     11819          4252       0.360
## 4 HB       14463          4984       0.345

Cancellation rates are significantly higher (59.9%) for hotels serving full board breakfasts. Bookings are significantly smaller for hotels with full board breakfasts so the significance of this finding is questionable.

arrival_date_year_cancellations <- data.df.one %>%
  select(is_canceled, arrival_date_year) %>%
  filter(is_canceled == 1) %>%
  group_by(arrival_date_year) %>%
  summarize(cancellations = n(), .groups = 'drop')

arrival_date_year_bookings <- data.df.one %>%
  select(is_canceled, arrival_date_year) %>%
  group_by(arrival_date_year) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
arrival_date_year_cancellations.df <- full_join(arrival_date_year_bookings, arrival_date_year_cancellations, by = "arrival_date_year") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(arrival_date_year)) %>%
  mutate(arrival_date_year_int = as.integer(arrival_date_year))

arrival_date_year_cancellations.chart <- ggplot(arrival_date_year_cancellations.df, aes( y=cancel_rate, reorder(x=arrival_date_year, arrival_date_year_int), fill=arrival_date_year)) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=3,name="Set3")) +
    labs(y = "Cancel Rate", x = "Arrival Year", title = "Cancel Rate by Arrival Year") 

arrival_date_year_cancellations.chart;

arrival_date_year_cancellations.df;
## # A tibble: 3 x 5
##   arrival_date_year bookings cancellations cancel_rate arrival_date_year_int
##   <fct>                <int>         <int>       <dbl>                 <int>
## 1 2017                 40687         15745       0.387                     3
## 2 2016                 56707         20337       0.359                     2
## 3 2015                 21992          8138       0.370                     1

Bookings trended upwards in 2017 indicating people may be booking more due to the availability of other options such as AirBnB’s. Further research should examine trends in cancellation rates in the last 5 years.

country_cancellations <- data.df.one %>%
  select(is_canceled, country) %>%
  filter(is_canceled == 1) %>%
  group_by(country) %>%
  summarize(cancellations = n(), .groups = 'drop')

country_bookings <- data.df.one %>%
  select(is_canceled, country) %>%
  group_by(country) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
country_cancellations.df <- full_join(country_bookings, country_cancellations, by = "country") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(bookings)) 

country_cancellations.df
## # A tibble: 178 x 4
##    country bookings cancellations cancel_rate
##    <fct>      <int>         <int>       <dbl>
##  1 PRT        48586         27515       0.566
##  2 GBR        12129          2453       0.202
##  3 FRA        10415          1934       0.186
##  4 ESP         8568          2177       0.254
##  5 DEU         7287          1218       0.167
##  6 ITA         3766          1333       0.354
##  7 IRL         3375           832       0.247
##  8 BEL         2342           474       0.202
##  9 BRA         2224           830       0.373
## 10 NLD         2104           387       0.184
## # ... with 168 more rows
# Include only countries in 90th percentile of total bookings (>644)
quantile(country_cancellations.df$bookings,.90)
##   90% 
## 643.1
country_cancellations_90.df <- country_cancellations.df %>%
  filter(bookings >= 644)
  
country_cancellations_90.df;
## # A tibble: 18 x 4
##    country bookings cancellations cancel_rate
##    <fct>      <int>         <int>       <dbl>
##  1 PRT        48586         27515       0.566
##  2 GBR        12129          2453       0.202
##  3 FRA        10415          1934       0.186
##  4 ESP         8568          2177       0.254
##  5 DEU         7287          1218       0.167
##  6 ITA         3766          1333       0.354
##  7 IRL         3375           832       0.247
##  8 BEL         2342           474       0.202
##  9 BRA         2224           830       0.373
## 10 NLD         2104           387       0.184
## 11 USA         2097           501       0.239
## 12 CHE         1730           428       0.247
## 13 CN          1279           254       0.199
## 14 AUT         1263           230       0.182
## 15 SWE         1024           227       0.222
## 16 CHN          999           462       0.462
## 17 POL          919           215       0.234
## 18 ISR          669           169       0.253
colourCount = length(unique(country_cancellations_90.df$country))
 getPalette = colorRampPalette(brewer.pal(n=18, name="Set3"))
 
country_cancellations_90.chart <- ggplot(country_cancellations_90.df, aes( y=cancel_rate, reorder(x=country, -cancel_rate), fill=country)) + 
    geom_bar(stat="identity") +
    labs(y = "Cancel Rate", x = "Country", title = "Cancel Rate by Country") 

country_cancellations_90.chart;

# Makes sense that China has a high cancellation rate becasue, in oreder to ger a visa, you have to have already decided where you are going to stay. Further, it is very common to book a hotel and cancel upon arrival.

Geographical and cultural norms may account for differences in cancellation rates across the globe. Portugal and China have cancellation rates of 57% and 46%, respectively which are the highest cancellation rates among the 18 countries observed. Both of these countries require visitors have a hotel reservations to get a visa meaning after visitors obtain a Visa they cancel their hotel bookings and make different staying arrangements. Differences in the number of bookings for each country and cultural norms and practices may also attribute for different cancellation rates throughout the world.

market_segment_cancellations <- data.df.one %>%
  select(is_canceled, market_segment) %>%
  filter(is_canceled == 1) %>%
  group_by(market_segment) %>%
  summarize(cancellations = n(), .groups = 'drop')

market_segment_bookings <- data.df.one %>%
  select(is_canceled, market_segment) %>%
  group_by(market_segment) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
market_segment_cancellations.df <- full_join(market_segment_bookings, market_segment_cancellations, by = "market_segment") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

market_segment_cancellations.chart <- ggplot(na.omit(market_segment_cancellations.df),aes( y=cancel_rate, reorder(x=market_segment, -cancel_rate), fill=market_segment)) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=8,name="Set3")) +
    labs(y = "Cancel Rate", x = "Market Segment", title = "Cancel Rate by Market Segment")
   

  
market_segment_cancellations.chart;

market_segment_cancellations.df;
## # A tibble: 7 x 4
##   market_segment bookings cancellations cancel_rate
##   <fct>             <int>         <int>       <dbl>
## 1 Groups            19811         12097       0.611
## 2 Online TA         56476         20738       0.367
## 3 Offline TA/TO     24219          8311       0.343
## 4 Aviation            237            52       0.219
## 5 Corporate          5295           992       0.187
## 6 Direct            12605          1933       0.153
## 7 Complementary       743            97       0.131

Here we see that complementary trips are very unlikely to be cancelled. Additionally, we see that groups have a high cancellations rate. These findings are backed up by later findings that show, the more people in a reservation, the more likely to be cancelled. Bookings made online also have a relatively high cancel rate. This is likely due to the ease of making these plans and ease of canelling them.

reserved_room_type_cancellations <- data.df.one %>%
  select(is_canceled, reserved_room_type) %>%
  filter(is_canceled == 1) %>%
  group_by(reserved_room_type) %>%
  summarize(cancellations = n(), .groups = 'drop')

reserved_room_type_bookings <- data.df.one %>%
  select(is_canceled, reserved_room_type) %>%
  group_by(reserved_room_type) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
reserved_room_type_cancellations.df <- full_join(reserved_room_type_bookings, reserved_room_type_cancellations, by = "reserved_room_type") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

reserved_room_type_cancellations.chart <- ggplot(reserved_room_type_cancellations.df, aes( y=cancel_rate, reorder(x=reserved_room_type, -cancel_rate), fill=reserved_room_type)) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=10,name="Set3")) +
    labs(y = "Cancel Rate", x = "Reserved Room Type", title = "Cancel Rate by Reserved Room Type")

reserved_room_type_cancellations.df;
## # A tibble: 10 x 4
##    reserved_room_type bookings cancellations cancel_rate
##    <fct>                 <int>         <int>       <dbl>
##  1 P                        12            12       1    
##  2 H                       601           245       0.408
##  3 A                     85994         33630       0.391
##  4 G                      2094           763       0.364
##  5 L                         6             2       0.333
##  6 C                       932           308       0.330
##  7 B                      1114           364       0.327
##  8 D                     19201          6102       0.318
##  9 F                      2897           880       0.304
## 10 E                      6535          1914       0.293
reserved_room_type_cancellations.chart;

assigned_room_type_cancellations <- data.df.one %>%
  select(is_canceled, assigned_room_type) %>%
  filter(is_canceled == 1) %>%
  group_by(assigned_room_type) %>%
  summarize(cancellations = n(), .groups = 'drop')

assigned_room_type_bookings <- data.df.one %>%
  select(is_canceled, assigned_room_type) %>%
  group_by(assigned_room_type) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
assigned_room_type_cancellations.df <- full_join(assigned_room_type_bookings, assigned_room_type_cancellations, by = "assigned_room_type") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

assigned_room_type_cancellations.chart <- ggplot(assigned_room_type_cancellations.df, aes( y=cancel_rate, reorder(x=assigned_room_type, -cancel_rate), fill=assigned_room_type)) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=12,name="Set3")) +
    labs(y = "Cancel Rate", x = "Assigned Room Type", title = "Cancel Rate by Assigned Room Type")


assigned_room_type_cancellations.df;
## # A tibble: 12 x 4
##    assigned_room_type bookings cancellations cancel_rate
##    <fct>                 <int>         <int>       <dbl>
##  1 L                         1             1      1     
##  2 P                        12            12      1     
##  3 A                     74053         32948      0.445 
##  4 H                       712           251      0.353 
##  5 G                      2553           780      0.306 
##  6 E                      7806          1968      0.252 
##  7 D                     25322          6362      0.251 
##  8 F                      3751           927      0.247 
##  9 B                      2159           508      0.235 
## 10 C                      2375           446      0.188 
## 11 K                       279            12      0.0430
## 12 I                       363             5      0.0138
assigned_room_type_cancellations.chart;

#Add comparison to see if assigned room is the same as the reserved room
#Comparison to see if assigned room is the same as the reserved room
reserved_room_change.df <- data.df.one %>%
  select(is_canceled, reserved_room_type, assigned_room_type) %>%
  mutate(room_changed = ifelse(str_detect(as.character(assigned_room_type), as.character(reserved_room_type)) == "TRUE",0,1)) %>%
  mutate(changed_room_type = as.factor(paste(reserved_room_type,assigned_room_type,sep="_to_")))

changed_room_type_cancellations <- reserved_room_change.df %>%
  select(is_canceled, room_changed, changed_room_type) %>%
  filter(is_canceled == 1, room_changed == 1) %>%
  group_by(changed_room_type) %>%
  summarize(cancellations = n(), .groups = 'drop')

changed_room_type_bookings <- reserved_room_change.df %>%
  select(is_canceled, room_changed, changed_room_type) %>%
  group_by(changed_room_type) %>%
  summarize(bookings = n(), .groups = 'drop')


# Join monthly data on month
changed_room_type.df <- left_join(changed_room_type_cancellations, changed_room_type_bookings, by = "changed_room_type") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

# Include only countries in 75th percentile of total bookings 
quantile(changed_room_type.df$bookings,.75)
##    75% 
## 287.75
changed_room_type_75.df <- changed_room_type.df %>%
  filter(bookings >= quantile(changed_room_type.df$bookings,.75))
  
changed_room_type_75.df;
## # A tibble: 8 x 4
##   changed_room_type cancellations bookings cancel_rate
##   <fct>                     <int>    <int>       <dbl>
## 1 A_to_B                      146     1123      0.130 
## 2 A_to_C                      139     1447      0.0961
## 3 A_to_F                       27      417      0.0647
## 4 D_to_E                       35      686      0.0510
## 5 E_to_F                       19      404      0.0470
## 6 A_to_D                      319     7548      0.0423
## 7 A_to_E                       47     1156      0.0407
## 8 D_to_A                       10      312      0.0321
changed_room_type_75.chart <- ggplot(changed_room_type_75.df, aes( y=cancel_rate, reorder(x=changed_room_type, -cancel_rate), fill=changed_room_type)) + 
    geom_bar( stat="identity") +
  scale_fill_manual(values = brewer.pal(n=12,name="Set3")) +
    labs(y = "Cancel Rate", x = "Changed Room Type", title = "Cancel Rate by Changed Room Type - 75th Percentile")

changed_room_type_75.chart

# check cancel rate when room is not changed. Used to support assigned room cancellations to determine whether assigned room was the room that was reserved when they canceled.

unchanged_room_type_cancellations <- reserved_room_change.df %>%
  select(is_canceled, room_changed, changed_room_type) %>%
  filter(is_canceled == 1, room_changed == 0) %>%
  group_by(changed_room_type) %>%
  summarize(cancellations = n(), .groups = 'drop')

unchanged_room_type_bookings <- reserved_room_change.df %>%
  select(is_canceled, room_changed, changed_room_type) %>%
  group_by(changed_room_type) %>%
  summarize(bookings = n(), .groups = 'drop')

unchanged_room_type.df <- left_join(unchanged_room_type_cancellations, unchanged_room_type_bookings, by = "changed_room_type") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

quantile(changed_room_type.df$bookings,.75)
##    75% 
## 287.75
unchanged_room_type_75.df <- unchanged_room_type.df %>%
  filter(bookings >= quantile(unchanged_room_type.df$bookings,.75))


unchanged_room_type_75.chart <- ggplot(unchanged_room_type.df, aes( y=cancel_rate, reorder(x=changed_room_type, -cancel_rate), fill=changed_room_type)) + 
    geom_bar( stat="identity") +
  scale_fill_manual(values = brewer.pal(n=12,name="Set3")) +
    labs(y = "Cancel Rate", x = "Unchanged Room Type", title = "Cancel Rate by Unchanged Room Type - 75th Percentile")

  

unchanged_room_type_75.chart

This shows us that if a customer is moved from an A type room to another type of room, they are more likely to cancel.

# lead_time histogram and analysis
lead_time <- data.df.one %>%
  select(lead_time, is_canceled) %>%
  mutate(lead_time_bins = ifelse(lead_time <= 30, "1 Month",
                          ifelse(lead_time <= 60, "2 Months",
                          ifelse(lead_time <= 90, "3 Months",
                          ifelse(lead_time <= 180, "4 - 6 Months",
                          ifelse(lead_time <= 365, "6 - 12 Months",
                          "More than a Year"))))))

lead_time_cancellations <- lead_time %>%
  select(is_canceled, lead_time_bins) %>%
  filter(is_canceled == 1) %>%
  group_by(lead_time_bins) %>%
  summarize(cancellations = n(), .groups = 'drop')

lead_time_bookings <- lead_time %>%
  select(is_canceled, lead_time_bins) %>%
  group_by(lead_time_bins) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
lead_time.df <- full_join(lead_time_bookings, lead_time_cancellations, by = "lead_time_bins") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

lead_time.chart <- ggplot(lead_time.df, aes( y=cancel_rate, reorder(x=lead_time_bins, -cancel_rate), fill=lead_time_bins)) + 
    geom_bar( stat="identity") +
  scale_fill_manual(values = brewer.pal(n=6, name="Set3")) +
    labs(y = "Cancel Rate", x = "Lead Time", title = "Cancel Rate by Lead Time")



lead_time.df;
## # A tibble: 6 x 4
##   lead_time_bins   bookings cancellations cancel_rate
##   <chr>               <int>         <int>       <dbl>
## 1 More than a Year     3148          2130       0.677
## 2 6 - 12 Months       21544         11947       0.555
## 3 4 - 6 Months        26439         11821       0.447
## 4 3 Months            12583          4967       0.395
## 5 2 Months            16970          6174       0.364
## 6 1 Month             38702          7181       0.186
lead_time.chart;

The farther in advance a booking is made, the more likely the booking is to be cancelled. Bookings made over 4 months ahead of time have a cancellation rate much higher than average. Bookings made 1 month in advance have a significantly lower than average (18.6%) cancellation rate.

# arrival_data_day_of_month_histogram
arrival_date_day_of_month_cancellations <- data.df.one %>%
  select(arrival_date_day_of_month, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(arrival_date_day_of_month) %>%
  summarize(cancellations = n(), .groups = 'drop')
arrival_date_day_of_month_bookings <- data.df.one %>%
  select(arrival_date_day_of_month, is_canceled) %>%
  group_by(arrival_date_day_of_month) %>%
  summarize(bookings = n(), .groups = 'drop')
arrival_date_day_of_month.df <- full_join(arrival_date_day_of_month_bookings, arrival_date_day_of_month_cancellations, by = "arrival_date_day_of_month") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

colourCount = length(unique(arrival_date_day_of_month.df$arrival_date_day_of_month))
getPalette = colorRampPalette(brewer.pal(n=31, name="Set3"))
arrival_date_day_of_month.chart <- ggplot(arrival_date_day_of_month.df, aes( y=cancel_rate, reorder(x=arrival_date_day_of_month, arrival_date_day_of_month), fill=as.factor(arrival_date_day_of_month))) +
    geom_bar( stat="identity") +
    labs(y = "Cancel Rate", x = "Date of Month", title = "Cancel Rate by Arrival Date of Month") 
    
arrival_date_day_of_month.chart;

arrival_date_day_of_month.df;
## # A tibble: 31 x 4
##    arrival_date_day_of_month bookings cancellations cancel_rate
##                        <int>    <int>         <int>       <dbl>
##  1                         8     3921          1647       0.420
##  2                        17     4406          1837       0.417
##  3                         3     3854          1544       0.401
##  4                         1     3626          1444       0.398
##  5                        15     4196          1670       0.398
##  6                         7     3665          1441       0.393
##  7                        26     4147          1626       0.392
##  8                         5     4315          1677       0.389
##  9                        28     3946          1511       0.383
## 10                        14     3819          1461       0.383
## # ... with 21 more rows

There does not appear to be a relationship between the arrival date of month and the cancellation rate.

#Total length of stay histogram
length_of_stay <- data.df.one %>%
  select(stays_in_week_nights,stays_in_weekend_nights,is_canceled) %>%
  mutate(total_days = stays_in_week_nights+stays_in_weekend_nights)
length_of_stay_cancellations <- length_of_stay %>%
  select(total_days, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(total_days) %>%
  summarize(cancellations = n(), .groups = 'drop')
length_of_stay_bookings <- length_of_stay %>%
  select(total_days, is_canceled) %>%
  group_by(total_days) %>%
  summarize(bookings = n(), .groups = 'drop')
length_of_stay.df <- full_join(length_of_stay_bookings, length_of_stay_cancellations, by = "total_days") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))
# only chart where bookings are 50th percentile or higher. 50th percentile is about a month, anything longer, we did not consider a typical vacation.
quantile(length_of_stay.df$bookings,.50)
## 50% 
##  20
colourCount = length(unique(length_of_stay.df$total_days))
 getPalette = colorRampPalette(brewer.pal(n=23, name="Set3"))
length_of_stay.chart <-length_of_stay.df %>%
  filter(bookings >= quantile(length_of_stay.df$bookings,.50),) %>%
  ggplot(aes( y=cancel_rate, reorder(x=total_days, total_days), fill=as.factor(total_days))) +
    geom_bar( stat="identity") +
    labs(y = "Cancel Rate", x = "Length of Stay", title = "Cancel Rate by Length of Stay")
  
length_of_stay.chart;

tibble(length_of_stay.df);
## # A tibble: 45 x 4
##    total_days bookings cancellations cancel_rate
##         <int>    <int>         <int>       <dbl>
##  1         26        6             6       1    
##  2         33        3             3       1    
##  3         20       14            13       0.929
##  4         30       13            11       0.846
##  5         24        6             5       0.833
##  6         18       35            29       0.829
##  7         27        5             4       0.8  
##  8         29       14            11       0.786
##  9         23        8             6       0.75 
## 10         19       22            16       0.727
## # ... with 35 more rows

There is an upward trend indicating guests are more likely to cancel when they have longer stays.

# Repeated guest histogram
is_repeated_guest_cancellations <- data.df.one %>%
  select(is_repeated_guest, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(is_repeated_guest) %>%
  summarize(cancellations = n(), .groups = 'drop')
is_repeated_guest_bookings <- data.df.one %>%
  select(is_repeated_guest, is_canceled) %>%
  group_by(is_repeated_guest) %>%
  summarize(bookings = n(), .groups = 'drop')
is_repeated_guest.df <- full_join(is_repeated_guest_bookings, is_repeated_guest_cancellations, by = "is_repeated_guest") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

is_repeated_guest.chart <- ggplot(is_repeated_guest.df, aes( y=cancel_rate, reorder(x=is_repeated_guest, -cancel_rate), fill=as.factor(is_repeated_guest))) +
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
    labs(y = "Cancel Rate", x = "Repeated Guest (0 = no, 1 = yes)", title = "Cancel Rate by Repeated Guests")

  
is_repeated_guest.chart;

tibble(is_repeated_guest.df);
## # A tibble: 2 x 4
##   is_repeated_guest bookings cancellations cancel_rate
##               <int>    <int>         <int>       <dbl>
## 1                 0   115576         43668       0.378
## 2                 1     3810           552       0.145

Repeated guests have a cancellation rate of 14.5%, whereas first time guests have a cancellation rate of 37.8%. Repeated guests’ cancellation rate of 14.5% is significantly less than the average cancellation rate of 37.04% indicating repeated guests are more dependable.

# previous cancellations histogram
is_previous_cancellations <- data.df.one %>%
  mutate(previous_cancellations = ifelse(previous_cancellations >= 1,1,0))
previous_cancellations_cancellations <- is_previous_cancellations %>%
  select(previous_cancellations, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(previous_cancellations) %>%
  summarize(cancellations = n(), .groups = 'drop')
previous_cancellations_bookings <- is_previous_cancellations %>%
  select(previous_cancellations, is_canceled) %>%
  group_by(previous_cancellations) %>%
  summarize(bookings = n(), .groups = 'drop')
previous_cancellations.df <- full_join(previous_cancellations_bookings, previous_cancellations_cancellations, by = "previous_cancellations") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))
previous_cancellations.chart <- ggplot(previous_cancellations.df, aes( y=cancel_rate, reorder(x=previous_cancellations, -cancel_rate), fill=as.factor(previous_cancellations))) +
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
    labs(y = "Cancel Rate", x = "Previous Cancellations (0 = no, 1 = yes)", title = "Cancel Rate by Previous Cancellations")
  
previous_cancellations.chart;

tibble(previous_cancellations.df)
## # A tibble: 2 x 4
##   previous_cancellations bookings cancellations cancel_rate
##                    <dbl>    <int>         <int>       <dbl>
## 1                      1     6484          5942       0.916
## 2                      0   112902         38278       0.339

Guests with previous cancellations have an extremely high cancellation rate of 91.6%. Guests who have previously cancelled should assigned a much higher level of cancellation risk than those who have not previously cancelled.

#previous bookings not canceled histogram
is_previous_bookings_not_canceled <- data.df.one %>%
  mutate(previous_bookings_not_canceled = ifelse(previous_bookings_not_canceled >= 1,1,0))
previous_bookings_not_canceled_cancellations <- is_previous_bookings_not_canceled %>%
  select(previous_bookings_not_canceled, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(previous_bookings_not_canceled) %>%
  summarize(cancellations = n(), .groups = 'drop')
previous_bookings_not_canceled_bookings <- is_previous_bookings_not_canceled %>%
  select(previous_bookings_not_canceled, is_canceled) %>%
  group_by(previous_bookings_not_canceled) %>%
  summarize(bookings = n(), .groups = 'drop')
previous_bookings_not_canceled.df <- full_join(previous_bookings_not_canceled_cancellations, previous_bookings_not_canceled_bookings, by = "previous_bookings_not_canceled") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))
previous_bookings_not_canceled.chart <- ggplot(previous_bookings_not_canceled.df, aes( y=cancel_rate, reorder(x=previous_bookings_not_canceled, -cancel_rate), fill=as.factor(previous_bookings_not_canceled))) +
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
    labs(y = "Cancel Rate", x = "Previous Bookings not Cancelled (0 = no, 1 = yes)", title = "Cancel Rate by Previous Bookings Not Cancelled")

previous_bookings_not_canceled.chart;

tibble(previous_bookings_not_canceled.df);
## # A tibble: 2 x 4
##   previous_bookings_not_canceled cancellations bookings cancel_rate
##                            <dbl>         <int>    <int>       <dbl>
## 1                              0         44020   115766      0.380 
## 2                              1           200     3620      0.0552

Guests are less likely to cancel if they have previously bookings that were not cancelled. Observing guests’ cancellation history can be helpful in predicting the odds they will cancel.

#Whether customer waited on list for reservation? histogram
is_days_in_waiting_list <- data.df.one %>%
  mutate(days_in_waiting_list = ifelse(days_in_waiting_list >= 1,1,0))

days_in_waiting_list_cancellations <- is_days_in_waiting_list %>%
  select(days_in_waiting_list, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(days_in_waiting_list) %>%
  summarize(cancellations = n(), .groups = 'drop')

days_in_waiting_list_bookings <- is_days_in_waiting_list %>%
  select(days_in_waiting_list, is_canceled) %>%
  group_by(days_in_waiting_list) %>%
  summarize(bookings = n(), .groups = 'drop')

days_in_waiting_list.df <- full_join(days_in_waiting_list_cancellations, days_in_waiting_list_bookings, by = "days_in_waiting_list") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

days_in_waiting_list.chart <- ggplot(days_in_waiting_list.df, aes( y=cancel_rate, reorder(x=days_in_waiting_list, -cancel_rate), fill=as.factor(days_in_waiting_list))) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
    labs(y = "Cancel Rate", x = "On Waiting List (0 = no, 1 = yes)", title = "Cancel Rate by Being on Waiting List")

tibble(days_in_waiting_list.df);
## # A tibble: 2 x 4
##   days_in_waiting_list cancellations bookings cancel_rate
##                  <dbl>         <int>    <int>       <dbl>
## 1                    1          2359     3698       0.638
## 2                    0         41861   115688       0.362
days_in_waiting_list.chart

Approximately 63.8% of bookings are cancelled when customers are placed on a waiting list, hence customers are almost twice as likely to cancel when they are on a waiting list. Customers who are placed on a waiting list most likely search for other available options, and cancel their booking when they find a booking elsewhere that does not require them to be placed on a waiting list.

#Whether customer has special requests histogram
is_total_of_special_requests <- data.df.one %>%
  mutate(total_of_special_requests = ifelse(total_of_special_requests >= 1,1,0))

total_of_special_requests_cancellations <- is_total_of_special_requests %>%
  select(total_of_special_requests, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(total_of_special_requests) %>%
  summarize(cancellations = n(), .groups = 'drop')

total_of_special_requests_bookings <- is_total_of_special_requests %>%
  select(total_of_special_requests, is_canceled) %>%
  group_by(total_of_special_requests) %>%
  summarize(bookings = n(), .groups = 'drop')

total_of_special_requests.df <- full_join(total_of_special_requests_cancellations, total_of_special_requests_bookings, by = "total_of_special_requests") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

total_of_special_requests.chart <- ggplot(total_of_special_requests.df, aes( y=cancel_rate, reorder(x=total_of_special_requests, -cancel_rate), fill=as.factor(total_of_special_requests))) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
    labs(y = "Cancel Rate", x = "Special Request (0 = no, 1 = yes)", title = "Cancel Rate by Special Requests")

total_of_special_requests.df;
## # A tibble: 2 x 4
##   total_of_special_requests cancellations bookings cancel_rate
##                       <dbl>         <int>    <int>       <dbl>
## 1                         0         33556    70318       0.477
## 2                         1         10664    49068       0.217
total_of_special_requests.chart;

Customers with special requests have a cancellation rate of 21.7% indicating they are less likely to cancel than those without special requests. It is more difficult for customers with special requests to search for alternative options because other hotels may not be able to cater to their special requests.

# total people histogram
number_of_people <- data.df.one %>%
  select(children,babies,adults,is_canceled) %>%
  mutate(total_people = children+babies+adults) 

number_of_people_cancellations <- number_of_people %>%
  select(total_people, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(total_people) %>%
  summarize(cancellations = n(), .groups = 'drop')

number_of_people_bookings <- number_of_people %>%
  select(total_people, is_canceled) %>%
  group_by(total_people) %>%
  summarize(bookings = n(), .groups = 'drop')

number_of_people.df <- full_join(number_of_people_bookings, number_of_people_cancellations, by = "total_people") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

# only chart where bookings are 50th percentile or higher. 50th percentile is about a month, anything longer, we did not consider a typical vacation.
quantile(number_of_people.df$bookings,.50)
## 50% 
##   2
number_of_people.chart <-number_of_people.df %>%
  filter(bookings >= quantile(number_of_people.df$bookings,.50),) %>%
  ggplot(aes( y=cancel_rate, reorder(x=total_people, total_people), fill=as.factor(total_people))) + 
    geom_bar( stat="identity") +
   scale_fill_manual(values = brewer.pal(n=11, name="Set3")) +
    labs(y = "Cancel Rate", x = "Total Number of People", title = "Cancel Rate by Total Number of People")

  
  
number_of_people.chart;

number_of_people.df
## # A tibble: 15 x 4
##    total_people bookings cancellations cancel_rate
##           <int>    <int>         <int>       <dbl>
##  1            6        1             1       1    
##  2           20        2             2       1    
##  3           26        5             5       1    
##  4           27        2             2       1    
##  5           40        1             1       1    
##  6           50        1             1       1    
##  7           55        1             1       1    
##  8           10        2             1       0.5  
##  9           12        2             1       0.5  
## 10            4     3929          1646       0.419
## 11            2    82048         32569       0.397
## 12            3    10494          3376       0.322
## 13            1    22581          6555       0.290
## 14            5      137            34       0.248
## 15            0      180            25       0.139

A positive relationship is observed between the number of people in a booking and the cancellation rate. The more people in a reservation, the more likely the reservation is to be cancelled. The extremely small number of observations for bookings with more than 4 people challenges the signifance of this relationship.

#Have Children Histogram
is_children <- data.df.one %>%
  mutate(children_babies = ifelse(babies >= 1,1,
                    ifelse(children >= 1,1,0)))

is_children_cancellations <- is_children %>%
  select(children_babies, is_canceled) %>%
  filter(is_canceled == 1) %>%
  group_by(children_babies) %>%
  summarize(cancellations = n(), .groups = 'drop')

is_children_bookings <- is_children %>%
  select(children_babies, is_canceled) %>%
  group_by(children_babies) %>%
  summarize(bookings = n(), .groups = 'drop')

is_children.df <- full_join(is_children_cancellations, is_children_bookings, by = "children_babies") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

is_children.chart <- ggplot(is_children.df, aes( y=cancel_rate, reorder(x=children_babies, -cancel_rate), fill=as.factor(children_babies))) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
    labs(y = "Cancel Rate", x = "Children (0 = no, 1 = yes)", title = "Cancel Rate by Children")


is_children.df;
## # A tibble: 2 x 4
##   children_babies cancellations bookings cancel_rate
##             <dbl>         <int>    <int>       <dbl>
## 1               0         40961   110054       0.372
## 2               1          3259     9332       0.349
is_children.chart;

Whether or not a guest has children does not appear to affect the cancellation rate. This finding is surprising as we expected guests with children to be more likely to cancel due to the challenges of traveling with children.

adr <- data.df.one %>%
  select(adr, is_canceled) %>%
  mutate(adr_bins = ifelse(adr <= 30, "under 30",
                          ifelse(adr <= 60, "30 - 90",
                          ifelse(adr <= 90, "60 - 90",
                          ifelse(adr <= 120, "90 - 120",
                          ifelse(adr <= 150, "120 - 150",
                          ifelse(adr <= 180,"150 - 180",
                          ifelse(adr <= 210,"180 - 210",
                          ifelse(adr <= 240,"210 - 240",
                          "240 and higher")))))))))

adr_cancellations <- adr %>%
  select(is_canceled, adr_bins) %>%
  filter(is_canceled == 1) %>%
  group_by(adr_bins) %>%
  summarize(cancellations = n(), .groups = 'drop')

adr_bookings <- adr %>%
  select(is_canceled, adr_bins) %>%
  group_by(adr_bins) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data
adr.df <- full_join(adr_bookings, adr_cancellations, by = "adr_bins") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

adr.chart <- ggplot(adr.df, aes( y=cancel_rate, reorder(x=adr_bins, -cancel_rate), fill=as.factor(adr_bins))) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=9, name="Set3")) +
    labs(y = "Cancel Rate", x = "Average Daily Rate", title = "Cancel Rate by Average Daily Rate")


adr.df;
## # A tibble: 9 x 4
##   adr_bins       bookings cancellations cancel_rate
##   <chr>             <int>         <int>       <dbl>
## 1 60 - 90           38077         15565       0.409
## 2 210 - 240          2189           863       0.394
## 3 120 - 150         16703          6430       0.385
## 4 90 - 120          30112         11587       0.385
## 5 150 - 180          8622          3302       0.383
## 6 240 and higher     1702           643       0.378
## 7 180 - 210          3995          1479       0.370
## 8 30 - 90           14459          3767       0.261
## 9 under 30           3527           584       0.166
adr.chart

The cheaper the hotel the less likely the customer is to cancel. The cancellation rate for bookings with an ADR of $30-$90 is 26%, and the cancellation rate for bookings with an ADR under $30 is 16.6%. Both of these percentages are significantly lower than the average cancellation rate of 37.04%. This is logical because lower income families are more likely to value vacation more as, it does not happen often.

levels(data.df.one$deposit_type)
## [1] "No Deposit" "Non Refund" "Refundable"
deposit_type_cancellations <- data.df.one %>%
  select(is_canceled, deposit_type) %>%
  filter(is_canceled == 1) %>%
  group_by(deposit_type) %>%
  summarize(cancellations = n(), .groups = 'drop')

deposit_type_bookings <- data.df.one %>%
  select(is_canceled, deposit_type) %>%
  group_by(deposit_type) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
deposit_type.df <- full_join(deposit_type_bookings, deposit_type_cancellations, by = "deposit_type") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

deposit_type.chart <- ggplot(deposit_type.df, aes( y=cancel_rate, reorder(x=deposit_type, -cancel_rate), fill=as.factor(deposit_type))) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=3, name="Set3")) +
    labs(y = "Cancel Rate", x = "Deposit Type", title = "Cancel Rate by Deposit Type")



deposit_type.df;
## # A tibble: 3 x 4
##   deposit_type bookings cancellations cancel_rate
##   <fct>           <int>         <int>       <dbl>
## 1 Non Refund      14587         14494       0.994
## 2 No Deposit     104637         29690       0.284
## 3 Refundable        162            36       0.222
deposit_type.chart;

Customers are much more likely to cancel if there is a non-refunadable deposit. This effect is most likely due to reverse causality as hotels with extremely high cancellation rates are more likely to have non-refundable deposits to protect against losses.

#whether customer used agent or company
agent_company <- data.df.one %>%
  select(agent,company, is_canceled) %>%
  mutate(agent = ifelse(agent=="none",0,agent)) %>%
  mutate(company = ifelse(company=="none",0,company)) %>%
  mutate(agent_company_cancat = (ifelse(agent > 0 | company > 0,1,0)))

agent_company_cancellations <- agent_company %>%
  select(is_canceled, agent_company_cancat) %>%
  filter(is_canceled == 1) %>%
  group_by(agent_company_cancat) %>%
  summarize(cancellations = n(), .groups = 'drop')

agent_company_bookings <- agent_company %>%
  select(is_canceled, agent_company_cancat) %>%
  group_by(agent_company_cancat) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data
agent_company.df <- full_join(agent_company_bookings, agent_company_cancellations, by = "agent_company_cancat") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

agent_company.chart <- ggplot(agent_company.df, aes( y=cancel_rate, reorder(x=agent_company_cancat, -cancel_rate), fill=as.factor(agent_company_cancat))) + 
    geom_bar( stat="identity") +
    scale_fill_manual(values = brewer.pal(n=2, name="Set3")) +
    labs(y = "Cancel Rate", x = "Booked Through Agent/Company (1 = yes, 0 = no)", title = "Cancel Rate by How Booking was Made")

agent_company.df;
## # A tibble: 2 x 4
##   agent_company_cancat bookings cancellations cancel_rate
##                  <dbl>    <int>         <int>       <dbl>
## 1                    1   109628         41366       0.377
## 2                    0     9758          2854       0.292
agent_company.chart

Customers who make bookings through travel agents or companies have a cancellation rate of 29.2% which is about 8% less than the average cancellation rate. Hence, customers may be less likely to cancel when they make bookings through agents or companies.

# customer_type histogram and analysis
levels(data.df.one$customer_type)
## [1] "Contract"        "Group"           "Transient"       "Transient-Party"
customer_type_cancellations <- data.df.one %>%
  select(is_canceled, customer_type) %>%
  filter(is_canceled == 1) %>%
  group_by(customer_type) %>%
  summarize(cancellations = n(), .groups = 'drop')

customer_type_bookings <- data.df.one %>%
  select(is_canceled, customer_type) %>%
  group_by(customer_type) %>%
  summarize(bookings = n(), .groups = 'drop')

# Join monthly data on month
customer_type.df <- full_join(customer_type_bookings, customer_type_cancellations, by = "customer_type") %>%
  mutate(cancel_rate = cancellations/bookings) %>%
  arrange(desc(cancel_rate))

customer_type.chart <- ggplot(customer_type.df, aes( y=cancel_rate, reorder(x=customer_type, -cancel_rate), fill=as.factor(customer_type))) + 
    geom_bar( stat="identity") +
   scale_fill_manual(values = brewer.pal(n=4, name="Set3")) +
    labs(y = "Cancel Rate", x = "Customer Type", title = "Cancel Rate by Customer Type")
 
  
customer_type.df;
## # A tibble: 4 x 4
##   customer_type   bookings cancellations cancel_rate
##   <fct>              <int>         <int>       <dbl>
## 1 Transient          89613         36514       0.407
## 2 Contract            4076          1262       0.310
## 3 Transient-Party    25120          6385       0.254
## 4 Group                577            59       0.102
customer_type.chart

Cancellation rate varies depending on the customer type. Contract, group, and transient-parties are less likely to cancel than average customers, whereas transient customers are more likely to cancel. The majority of bookings are made by transient customers which explains why it has such a large impact on the mean cancellation rate.

Summary

Cancellations are an imminent threat in the travel and tourism industry. To survive in the competitive industry, hotels must utilize cancellation policies and tactful overbooking strategies. A major hotel chain enlisted our help in providing advice on creating the most efficient and profitable overbooking strategy in order to remain competitive in the industry. To determine the best overbooking strategy we analyzed comprehensive hotel booking data from 2015 - 2017. Our strategy involved exploring relationships between individual variables and the cancellation rate to determine which factors have the largest influence on booking cancellations.

The analysis provided numerous major insights. The average cancellation rate of 37% was used as a baseline for comparing results. A significant difference was found between the cancellation rate of city hotels and resort hotels. Approximately 42% of bookings at city hotels were cancelled, whereas only 28% of bookings at resort hotels were cancelled. Additionally, we found cancellations are highest from April - June. A negative relationship was observed between lead time and the cancellation rate. The farther in advance a booking is made, the more likely it is to be cancelled. Customer history and characteristics can also be used to determine cancellation risk. The cancellation rate for repeat guests is approximately 23% lower than the average cancellation rate, and guests with previous cancellations have an extremely high cancellation rate of 91.6%. Additionally, being placed on the waiting list also greatly increases the likelihood of a booking being cancelled, and customers with special requests are less likely to cancel their bookings. A negative relationship was found between the average daily rate and the cancellation rate. Bookings with very low (under $90) average daily rates had much lower cancellation rates than average. Finally, customers are also 8% less likely to cancel their bookings when they book through a travel agency/company.

These findings provide insight on which variables our client should consider in crafting their overbooking strategy. More aggressive overbooking strategies should be utilized by city hotels, during the springtime, for bookings with long lead times, for guests with previous cancelations, and for guests placed on waiting lists. Less aggressive overbooking strategies should be utilized by resort hotels, for repeat guests, for guests with special requests, for bookings with extremely low ADR’s, and for bookings made through a travel agency/company. These factors are unlikely to align with each other, hence hotels must assess and weigh the importance of each of these factors in determining cancellation risk. Overall, assessing hotel type, customer history and characteristics, and booking characteristics is a strategic tool hotels can use to create optimal overbooking strategies to remain competitive in the industry.
A major limitation of our analysis is our examination of correlation rather than causation. We examined the impact of various variables on the cancellation rate to see if there was a relationship between the two variables, however we did not assess whether specific variables truly caused an increase or decrease in the cancellation rate. Some variables might decrease or increase as a result of the cancellation rate changing, indicating reverse causality. The relationship between nonrefundable deposits and the cancellation rate are one example of this. Higher cancellation rates are more likely to result in the use of nonrefundable deposits rather than nonrefundable deposits leading to higher cancellation rates. Another limitation of our analysis was our inability to measure multicollinearity between variables. We analyzed the impact of numerous independent variables on the cancellation rate, however because we did not fit any models to our data or perform a residual analysis it is impossible to determine whether these variables are truly independent of one another. Many of these variables may cause other independent variables to change, further challenging the significance of our results. Future research should examine causation between various variables and the cancellation rate, and fit models to the data and perform tests for multicollinearity.