Project: Hotel Booking Demand Prediction

Prepared by: Group 9

Name Matric Number
Lee Sok Qi 17098730
Tang Wei Lai  17143457
Siti Rawiah Binti Mohamad Tarmithi S2176711
Ili Fadilah Binti Alias S2156787
Ainul Balqis Binti Sawal 17196450

Introduction

Hotel sector is in the constantly shifting. In order to enhance pricing, marketing and other business strategies, hotel managers must comprehend the element that affect booking demand. In this analysis, we will explore the Hotel Booking Demand dataset from Kaggle, which contains booking information for a resort hotel and a city hotel and includes information such as when the booking was made, length of stay, the number of adults, and the amount paid. Hotel sector is extremely unpredictable, and reservations depends on a wide range of factors including but not limited to hotel type and seasonality. To better assist the hotel management in making better strategies, it is more crucial to analyse the patterns found in the historical data. The historical data could be utilized by the hotel management to run the number of marketing initiatives designed to boost the businesses.

Our goal is to identify trends and patterns in the data that can provide actionable insights for the hotel management and to predict the likelihood of bookings being cancelled in order to mitigate the revenue loss derived from the booking cancellations

Questions

Question 1: Which type of the hotel have a higher number of bookings by adults with children

Question 2: What is the month with the greatest number of guest arrival?

Question 3: What is guest preferred meal plan?

Question 4: Which attributes in the dataset correlates with one another and contributes significantly for the hotel cancellation predictive modelling?

Objectives

To study the correlation between hotel booking demand versus the variables from the dataset based on the data analysis. The result will be useful in developing a model to predict the likelihood of booking cancellations that is expected to provide earlier information associated with the hotel booking demand.

Details Dataset

The dataset contains information on bookings for two hotels in Portugal (a resort and a city hotel) scheduled to arrive in a period between July 1, 2015, and August 31, 2017.

A total of 119,390 observations and 32 variables can be found in this dataset including those that were cancelled. For both hotels, the same information was collected describing 40,060 observations for the resort and 79,330 observations for the city hotel.

The dataset can be accessed from Kaggle: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand?resource=download

Data Descriptions

No. Variables Description
1 hotel Type of Hotel whether Resort Hotel or City Hotel
2 is_canceled Value indicating if the booking was canceled (1) or not (0)
3 lead_time Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
4 arrival_date_year Year of arrival date
5 arrival_date_month Month of arrival date
6 arrival_date_week_number week number of year for arrival date
7 arrival_date_day_of_month Day of arrival date
8 stays_in_weekend_nights Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
9 stays_in_week_nights Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel
10 adults Number of adults
11 children Number of children
12 babies Number of babies
13 meal Type of meal booked. Undefined/SC – no meal package, BB – Bed & Breakfast, HB – Half board (breakfast and one other meal – usually dinner), FB – Full board (breakfast, lunch and dinner)
14 country Country of origin
15 market_segment Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
16 distribution_channel Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
17 is_repeated_guest Value indicating if the booking name was from a repeated guest (1) or not (0)
18 previous_cancellations Number of previous bookings that were cancelled by the customer prior to the current booking
19 previous_bookings_not_canceled Number of previous bookings not cancelled by the customer prior to the current booking
20 reserved_room_type Code of room type reserved. Code is presented instead of designation for anonymity reasons
21 assigned_room_type Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request.
22 booking_changes Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
23 deposit_type Type of deposit made for booking: No Deposit – no deposit was made,Non Refund – a deposit was made in the value of the total stay cost, Refundable – a deposit was made with a value under the total cost of stay
24 agent ID of the travel agency that made the booking
25 company ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
26 days_in_waiting_list Number of days the booking was in the waiting list before it was confirmed to the customer
27 customer_type Type of booking: Contract - when the booking has an allotment or other type of contract associated to it, Group – when the booking is associated to a group, Transient – when the booking is not part of a group or contract, and is not associated to other transient booking, Transient-party – when the booking is transient, but is associated to at least other transient booking
28 adr Average Daily Rate (ADR) as defined by dividing the sum of all lodging transactions by the total number of staying nights
29 required_car_parking_spaces Number of car parking spaces required by the customer
30 total_of_special_requests Number of special requests made by the customer (e.g., twin bed or high floor)
31 reservation_status Reservation last status:Canceled – booking was canceled by the customer,Check-Out – customer has checked in but already departed, No-Show – customer did not check-in and did inform the hotel of the reason why
32 reservation_status_date Date at which the last status was set. This variable can be used in conjunction with the Reservation Status to understand when the booking was canceled or when did the customer checked-out of the hotel

Data Preparation

The third fundamental step of CRISP DM is Data Preparation. In this section, the processes include identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity.

Load the data from a .csv file
The first step is to import all relevant libraries and import the dataset.

library(dplyr)
library(stringr)
library(tidyverse)
library(lubridate)
library(caret)
library(randomForest)
library(datasets)
library(caTools)
library(party)
library(magrittr)
library(forcats)

raw_data = read.csv("hotel_bookings.csv")

The directory of the file depends on the dataset’s location saved and it would be different and need to amend accordingly.

Explore the raw data

head(raw_data)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03
print(dim(raw_data))
## [1] 119390     32

From the above code, we observed that the dataset has 119390 rows and 32 columns.

Understand Data

print(summary(raw_data))
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 

The mean, median, first quartile, third quartile, minimum and maximum value of each columns are shown above.

str(raw_data)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

Data Conversion

After getting the data type, and data entry samples of each columns we change NULL values to NA. This is because in R programming NULL represents that the value in question simply doesn’t exist, rather than being existent but unknown.

raw_data[raw_data == "NULL"] <- NA

str(raw_data[,c("agent","company")])
## 'data.frame':    119390 obs. of  2 variables:
##  $ agent  : chr  NA NA NA "304" ...
##  $ company: chr  NA NA NA NA ...

We also would like to change the data type of reservation_status_date to date format.

raw_data$reservation_status_date <- as.Date(raw_data$reservation_status_date)
class(raw_data$reservation_status_date)
## [1] "Date"

Missing Value

We will check if there is any missing values in the dataset

missing_column <-colSums(is.na(raw_data))
print(missing_column)
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                            488 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                          16340 
##                        company           days_in_waiting_list 
##                         112593                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

There are 4 missing values in children, 488 in country, 16340 in agent, and 112593 in company. Next we find the percentage of missing values in each column.

Find the percentage of missing values in each column

percent_na <- round((missing_column/nrow(raw_data))*100,2)
print(percent_na)
##                          hotel                    is_canceled 
##                           0.00                           0.00 
##                      lead_time              arrival_date_year 
##                           0.00                           0.00 
##             arrival_date_month       arrival_date_week_number 
##                           0.00                           0.00 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                           0.00                           0.00 
##           stays_in_week_nights                         adults 
##                           0.00                           0.00 
##                       children                         babies 
##                           0.00                           0.00 
##                           meal                        country 
##                           0.00                           0.41 
##                 market_segment           distribution_channel 
##                           0.00                           0.00 
##              is_repeated_guest         previous_cancellations 
##                           0.00                           0.00 
## previous_bookings_not_canceled             reserved_room_type 
##                           0.00                           0.00 
##             assigned_room_type                booking_changes 
##                           0.00                           0.00 
##                   deposit_type                          agent 
##                           0.00                          13.69 
##                        company           days_in_waiting_list 
##                          94.31                           0.00 
##                  customer_type                            adr 
##                           0.00                           0.00 
##    required_car_parking_spaces      total_of_special_requests 
##                           0.00                           0.00 
##             reservation_status        reservation_status_date 
##                           0.00                           0.00

From the above code chunk we obtained the percentage of missing values in each attributes. It is apparent that company has the highest percentage of 94.31% missing values, followed by agent with 13.69% missing values. Hence, we decided to drop both of these attributes.

drop <- c("company","agent")
raw_data <- raw_data[,!names(raw_data) %in% drop]
print(dim(raw_data))
## [1] 119390     30

After removing the columns, there are only 30 attributes left in the dataset.

There are still missing values in children, and country attribute. However from the previous result, we know that there is only 0.41% of missing data on column country. Meanwhile we are not able to determine whether the percentage of missing value occur in children, as it is extremely small.

raw_data[which(is.na(raw_data$children)),]
##            hotel is_canceled lead_time arrival_date_year arrival_date_month
## 40601 City Hotel           1         2              2015             August
## 40668 City Hotel           1         1              2015             August
## 40680 City Hotel           1         1              2015             August
## 41161 City Hotel           1         8              2015             August
##       arrival_date_week_number arrival_date_day_of_month
## 40601                       32                         3
## 40668                       32                         5
## 40680                       32                         5
## 41161                       33                        13
##       stays_in_weekend_nights stays_in_week_nights adults children babies meal
## 40601                       1                    0      2       NA      0   BB
## 40668                       0                    2      2       NA      0   BB
## 40680                       0                    2      3       NA      0   BB
## 41161                       2                    5      2       NA      0   BB
##       country market_segment distribution_channel is_repeated_guest
## 40601     PRT      Undefined            Undefined                 0
## 40668     PRT         Direct            Undefined                 0
## 40680     PRT      Undefined            Undefined                 0
## 41161     PRT      Online TA            Undefined                 0
##       previous_cancellations previous_bookings_not_canceled reserved_room_type
## 40601                      0                              0                  B
## 40668                      0                              0                  B
## 40680                      0                              0                  B
## 41161                      0                              0                  B
##       assigned_room_type booking_changes deposit_type days_in_waiting_list
## 40601                  B               0   No Deposit                    0
## 40668                  B               0   No Deposit                    0
## 40680                  B               0   No Deposit                    0
## 41161                  B               0   No Deposit                    0
##         customer_type  adr required_car_parking_spaces
## 40601 Transient-Party 12.0                           0
## 40668 Transient-Party 12.0                           0
## 40680 Transient-Party 18.0                           0
## 41161 Transient-Party 76.5                           0
##       total_of_special_requests reservation_status reservation_status_date
## 40601                         1           Canceled              2015-08-01
## 40668                         1           Canceled              2015-08-04
## 40680                         2           Canceled              2015-08-04
## 41161                         1           Canceled              2015-08-09

The above result shows only 4 rows of data with missing values in the children attribute. Therefore we decide to remove the relatively small missing data.

raw_data <- na.omit(raw_data)
print(dim(raw_data))
## [1] 118898     30

Duplicate Data
Lastly any duplicate data is removed from the dataset as can lead to biased performance in the estimation model.

sum(duplicated(raw_data))
## [1] 31984
df <- unique(raw_data)
print(dim(df))
## [1] 86914    30

The cleaned dataset has 86914 rows and 30 attributes.

Exploratory Data Analysis

1. Which type of hotels having higher number of booking within 3 years?

59.37% of people preferred to stay at City Hotel and 40.63% of people preferred to stay at Resort Hotel respectively.

#Calculate the percentage of number of confirmed bookings by type of hotels
totalHotel <- df %>% filter(is_canceled == 0) %>% group_by(hotel) %>% summarize(percentange_hotel = round(n()/nrow(.)*100,2))
#plot pie chart
ggplot(totalHotel, aes(x = "", y = percentange_hotel, fill = hotel)) +
  geom_col() +
  geom_label(aes(label = percentange_hotel),position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  labs(title = "Pecentage of Number of Bookings by Hotel Types", fill = "Hotel")

2. Number of Bookings by year & type of hotel

As referring to bar chart below, we can see that year of 2016 are the highest number of booking.

#filter out cancellation bookings and grouped by year
bar_year <- df %>% filter(is_canceled == 0) %>% group_by(arrival_date_year)
ggplot(bar_year,aes(x=arrival_date_year,fill=hotel)) + geom_bar(position="dodge") +
  xlab("Year") + ylab("Number of Confirmed Bookings") + ggtitle("Number of Confirmed Bookings by Year")

3. Which month having the greatest number of confirmed booking?

July and August are the busiest months for both City Hotel and Resort Hotel. We can foresee that increase in number of bookings due to summer holiday. In addition, January & February and November & December had the lowest number of bookings among a year.

#filter out cancellation bookings and factor the months
bar1 <- df %>% filter(is_canceled == 0) %>% mutate(arrival_date_month = factor(arrival_date_month,levels = month.name)) %>% 
  arrange(arrival_date_month)
#filter cancelalation bookings and calculate total bookings
line1 <- df %>% filter(is_canceled == 0) %>% group_by(arrival_date_month) %>% summarise(count_arrival = n()) %>%
  mutate(arrival_date_month = factor(arrival_date_month,levels = month.name)) %>% 
  arrange(arrival_date_month)
#plot the bar chart and line graph
ggplot() + geom_bar(data=bar1,aes(x=arrival_date_month,fill=hotel),position="dodge") + 
  geom_line(data=line1,aes(x=arrival_date_month,y=count_arrival,group=1)) + 
  geom_point(data=line1,aes(x=arrival_date_month,y=count_arrival,group=1)) +
  xlab("Month of Arrival") + ylab("Number of Confirmed bookings") + ggtitle("Number of Confirmed Bookings by Month")

4. How many cancelled bookings out of total?

Both types of hotel contain large number of cancellation bookings.

#factor those cancellation booking
cancel_booking <- df
cancel_booking$is_canceled <- factor(cancel_booking$is_canceled)
ggplot(data=cancel_booking) + geom_bar(aes(x=hotel,fill=is_canceled),position="dodge") +
  ggtitle("Number of Booking by Hotel")

5. What is the most frequent deposit type for cancelled bookings?

Let’s explore more in cancelled bookings. We can see on below bar chart, “No Deposit” has the highest number of cancelled bookings.

#filter out confirmed cancellation
deposit <- df %>% filter(is_canceled == 1)
ggplot(deposit) + geom_bar(aes(x = deposit_type)) + coord_flip() +
  xlab("Type of Deposits") + ggtitle("Deposit Type for Cancelled Bookings")

6. Top 10 Countries Booking

Let’s see the top 10 countries booking. Most guests are come from Portugal and other countries in Europe.

#filter out cancellation booking and group data by country
countries_booking <- df %>% filter(is_canceled == 0) %>% group_by(country) %>%
  summarise(count_booking = n()) %>% arrange(desc(count_booking))
top10 <- head(countries_booking,10)
#plot bar chart
ggplot(data=top10) + geom_col(aes(x=fct_rev(fct_reorder(country,count_booking)),y=count_booking)) +
  xlab("Country") + ylab("Number of Confirmed Booking") + ggtitle("Top 10 Countries Booking")

7. Price Range per reserved room type

The figure shows that the average price per room depends on its type and standard deviation. We can observe that Type C hotel have the highest standard deviation which means that the price is changed frequently and differently.

#filtered out cancellation bookings
price_reserved <- df %>% filter(is_canceled == 0)
#plot boxplot
ggplot(price_reserved,aes(x = reserved_room_type, y= adr, fill = hotel)) + geom_boxplot() +
  xlab("Type of Reserved Room") + ylab("Average Daily Rate") + ggtitle("Price Range per Reserved Room Type")

8. Average Price of Reserved Room per Months

This plot showed that prices in Resort Hotel are much higher during summer and prices in City Hotel varies less and more expensive in November and December.

#price for city
price_city <- df %>% filter(is_canceled == 0,hotel == "City Hotel") %>%
  select(arrival_date_month, adr) %>%
  group_by(arrival_date_month) %>% summarize(avg_price_city = mean(adr))
#price for resort
price_resort <- df %>% filter(is_canceled == 0,hotel == "Resort Hotel") %>% select(arrival_date_month,adr) %>%
  group_by(arrival_date_month) %>% summarise(avg_price_resort = mean(adr))
#merge two data tgt
final_price <- merge(price_city,price_resort,by="arrival_date_month")
final_price <- final_price %>% mutate(arrival_date_month = factor(arrival_date_month,levels = month.name)) %>% 
  arrange(arrival_date_month)
#plot
ggplot(final_price) + geom_line(aes(x=arrival_date_month,y=avg_price_city,colour = "Average Price of City"),group=1) +
  geom_line(aes(x=arrival_date_month,y=avg_price_resort,colour = "Average Price of Resort"),group=1) +
  xlab("Month") + ylab("Average Price") + ggtitle("Average Price per Month")

9. How long do people stay at the hotels?

People tend to stay both hotels for 1 to 4 days. However, we can see that people also like to stay at Resort Hotel for a week.

#filter out cancellation bookings and calculate total stay of customers
stay_day <- df %>% filter(is_canceled == 0) %>% 
  mutate(total_night = stays_in_weekend_nights + stays_in_week_nights) %>%
  group_by(total_night,hotel) %>% summarize(count_day = n())
## `summarise()` has grouped output by 'total_night'. You can override using the
## `.groups` argument.
total_days <- stay_day %>% filter(total_night < 20)
ggplot(data=total_days) + geom_col(aes(x=total_night,y=count_day,fill=hotel),position="dodge")+
  xlab("Total Night") + ylab("Number of Stays")

10. Which type of the hotel have a higher number of bookings by adults with children & baby?

Both hotels are having less number of bookings by families.

#calculate those adult with children & baby
family <- df %>% filter(is_canceled == 0,children >=1,babies >=1) %>%
  group_by(hotel) %>% summarize(number_booking_F = n())
#calculate number of booking with non-family
non_family <- df %>% filter(is_canceled == 0,children == 0,babies == 0) %>%
  group_by(hotel) %>% summarize(number_booking_NF = n())
#merge them together
total_family <- merge(family,non_family,by="hotel")
total_family <- gather(total_family,"families","n_booking",2:3)
#plot and compare family & non-family
ggplot(total_family) + geom_col(aes(x=hotel,y=n_booking,fill = families),position = "dodge") +
  facet_wrap(~families) + ylab("Number of Confirmed Booking")

Narrow down the number of bookings by adult with children and baby (family). The plot clearly showed that family will tend to stay at Resort Hotel compared to City Hotel.

#narrow down those booking with family
ggplot(family) + geom_col(aes(x=hotel,y=number_booking_F),position = "dodge") +
  ylab("Number of Confirmed Booking")

11. What is guest preferred meal plan?

The preferred meal plan is bed and breakfast (BB) for both hotels. Half board (HB) and Self Catering (SC) are also famous in City Hotel. However, the guest in Resort Hotel preferred bed & breakfast (BB) and Half Board (HB).

#filtered out cancellation bookings and group by hotel & meal
mealss <- df %>% filter(is_canceled == 0) %>% group_by(hotel,meal) %>%
  summarize(count_meals = n())
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
#plot bar chart
ggplot(mealss) + geom_col(aes(x=meal,y=count_meals,fill=hotel)) + facet_wrap(~hotel) +
  ylab("Number of Booking")

Modeling

Data Preprocessing

Before we fit the data to the machine learning model, we have to select the appropriate attributes. Firstly, we could check the correlation between the target variable (is_canceled) and the other variable.

#Identify the correlation of target variable with other variables
num_col <- df %>% select_if(function(x) is.numeric(x) || is.integer(x))
x <- num_col[1]
y <- num_col[-1]
correlations <- cor(y,x)
correlations <- sort(abs(correlations[,]), decreasing = TRUE)
correlations
##    required_car_parking_spaces                      lead_time 
##                   0.1837511154                   0.1836828359 
##                            adr      total_of_special_requests 
##                   0.1266706709                   0.1218249897 
##                booking_changes              is_repeated_guest 
##                   0.0940744163                   0.0901714750 
##              arrival_date_year           stays_in_week_nights 
##                   0.0877446807                   0.0827781654 
##                         adults                       children 
##                   0.0802155750                   0.0670100420 
##        stays_in_weekend_nights         previous_cancellations 
##                   0.0595960345                   0.0509639808 
## previous_bookings_not_canceled                         babies 
##                   0.0505669732                   0.0209496334 
##      arrival_date_day_of_month           days_in_waiting_list 
##                   0.0052271944                   0.0042967581 
##       arrival_date_week_number 
##                   0.0007948913

From the correlation result, we can identify the appropriate attributes and drop the less important attributes.

num_col1 <- num_col[,c("is_canceled","lead_time","arrival_date_week_number","arrival_date_day_of_month",
                       "stays_in_weekend_nights","stays_in_week_nights","adults","children",
                       "babies","is_repeated_guest", "previous_cancellations",
                       "previous_bookings_not_canceled", "adr", "required_car_parking_spaces",
                       "total_of_special_requests")]

cat_col <- df %>% select_if(is.character)
cat_col1 <- cat_col[,c("hotel","arrival_date_month","meal","market_segment", 
                       "distribution_channel","reserved_room_type","deposit_type","customer_type")]

Next, we will need to change the data type of categorical attributes from character to integer.

#Identify the levels of categorical attributes
cat_col2 <- cat_col1
cat_col2[] <- lapply(cat_col2, factor)
lapply(cat_col2, levels)
## $hotel
## [1] "City Hotel"   "Resort Hotel"
## 
## $arrival_date_month
##  [1] "April"     "August"    "December"  "February"  "January"   "July"     
##  [7] "June"      "March"     "May"       "November"  "October"   "September"
## 
## $meal
## [1] "BB"        "FB"        "HB"        "SC"        "Undefined"
## 
## $market_segment
## [1] "Aviation"      "Complementary" "Corporate"     "Direct"       
## [5] "Groups"        "Offline TA/TO" "Online TA"    
## 
## $distribution_channel
## [1] "Corporate" "Direct"    "GDS"       "TA/TO"     "Undefined"
## 
## $reserved_room_type
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "L" "P"
## 
## $deposit_type
## [1] "No Deposit" "Non Refund" "Refundable"
## 
## $customer_type
## [1] "Contract"        "Group"           "Transient"       "Transient-Party"
#change the data type
cat_col1$hotel <- ifelse(cat_col1$hotel == "City Hotel", 1, 2)
cat_col1$arrival_date_month <- ifelse(cat_col1$arrival_date_month == "January", 1, 
                                      ifelse(cat_col1$arrival_date_month == "February", 2, 
                                             ifelse(cat_col1$arrival_date_month == "March", 3,
                                                    ifelse(cat_col1$arrival_date_month == "April" , 4,
                                                           ifelse(cat_col1$arrival_date_month == "May", 5,
                                                                  ifelse(cat_col1$arrival_date_month == "June", 6,
                                                                         ifelse(cat_col1$arrival_date_month == "July", 7,
                                                                                ifelse(cat_col1$arrival_date_month == "August", 8,
                                                                                       ifelse(cat_col1$arrival_date_month == "September", 9,
                                                                                              ifelse(cat_col1$arrival_date_month == "October", 10,
                                                                                                     ifelse(cat_col1$arrival_date_month == "November", 11,
                                                                                                            ifelse(cat_col1$arrival_date_month == "December", 12, cat_col1$arrival_date_month))))))))))))
cat_col1$meal <- ifelse(cat_col1$meal == "BB", 1, 
                        ifelse(cat_col1$meal == "FB", 2,
                               ifelse(cat_col1$meal == "HB", 3,
                                      ifelse(cat_col1$meal == "SC", 4,
                                             ifelse(cat_col1$meal == "Undefined", 5, cat_col1$meal)))))
cat_col1$market_segment <- ifelse(cat_col1$market_segment == "Aviation", 1,
                                  ifelse(cat_col1$market_segment == "Complementary", 2,
                                         ifelse(cat_col1$market_segment == "Corporate", 3,
                                                ifelse(cat_col1$market_segment == "Direct", 4,
                                                       ifelse(cat_col1$market_segment == "Groups", 5,
                                                              ifelse(cat_col1$market_segment == "Offline TA/TO", 6,
                                                                     ifelse(cat_col1$market_segment == "Online TA", 7,
                                                                            ifelse(cat_col1$market_segment == "Undefined", 8, cat_col1$market_segment))))))))
cat_col1$distribution_channel <- ifelse(cat_col1$distribution_channel == "Corporate", 1,
                                        ifelse(cat_col1$distribution_channel == "Direct", 2,
                                               ifelse(cat_col1$distribution_channel == "GDS", 3,
                                                      ifelse(cat_col1$distribution_channel == "TA/TO", 4,
                                                             ifelse(cat_col1$distribution_channel == "Undefined", 5, cat_col1$distribution_channel)))))
cat_col1$reserved_room_type <- ifelse(cat_col1$reserved_room_type == "A", 1,
                                      ifelse(cat_col1$reserved_room_type == "B", 2,
                                             ifelse(cat_col1$reserved_room_type == "C", 3,
                                                    ifelse(cat_col1$reserved_room_type == "D", 4,
                                                           ifelse(cat_col1$reserved_room_type == "E", 5,
                                                                  ifelse(cat_col1$reserved_room_type == "F", 6,
                                                                         ifelse(cat_col1$reserved_room_type == "G", 7,
                                                                                ifelse(cat_col1$reserved_room_type == "H", 8,
                                                                                       ifelse(cat_col1$reserved_room_type == "L", 9,
                                                                                              ifelse(cat_col1$reserved_room_type == "P", 10, cat_col1$reserved_room_type))))))))))
cat_col1$deposit_type <- ifelse(cat_col1$deposit_type == "No Deposit", 1,
                                ifelse(cat_col1$deposit_type == "Non Refund", 2,
                                       ifelse(cat_col1$deposit_type == "Refundable", 3, cat_col1$deposit_type)))
cat_col1$customer_type <- ifelse(cat_col1$customer_type == "Contract", 1,
                                 ifelse(cat_col1$customer_type == "Group", 2,
                                        ifelse(cat_col1$customer_type == "Transient", 3,
                                               ifelse(cat_col1$customer_type == "Transient-Party", 4, cat_col1$customer_type))))
                                      
cat_col1$arrival_date_month <- as.integer(cat_col1$arrival_date_month)
cat_col1$meal <- as.integer(cat_col1$meal)
cat_col1$market_segment <- as.integer(cat_col1$market_segment)
cat_col1$distribution_channel <- as.integer(cat_col1$distribution_channel)
cat_col1$reserved_room_type <- as.integer(cat_col1$reserved_room_type)
cat_col1$deposit_type <- as.integer(cat_col1$deposit_type)
cat_col1$customer_type <- as.integer(cat_col1$customer_type)

#combine both numerical attributes and categorical attributes
all_col <- cbind(num_col1, cat_col1)

Next, we can normalize the data to improve the performance of machine learning models

#check the variance
var(all_col, all_col$is_canceled)
##                                        [,1]
## is_canceled                     0.199799162
## lead_time                       7.069410459
## arrival_date_week_number        0.004851411
## arrival_date_day_of_month       0.020641887
## stays_in_weekend_nights         0.027406530
## stays_in_week_nights            0.075624715
## adults                          0.022441257
## children                        0.013683857
## babies                         -0.001062937
## is_repeated_guest              -0.007826555
## previous_cancellations          0.008420273
## previous_bookings_not_canceled -0.038834898
## adr                             3.111834316
## required_car_parking_spaces    -0.023033845
## total_of_special_requests      -0.045299525
## hotel                          -0.014988902
## arrival_date_month              0.004203200
## meal                            0.021500465
## market_segment                  0.107992895
## distribution_channel            0.062376650
## reserved_room_type              0.038705480
## deposit_type                    0.007925164
## customer_type                  -0.007358498
#Normalize
all_col$lead_time <- scale(all_col$lead_time)
# all_col$arrival_date_week_number <- scale(all_col$arrival_date_week_number)
# all_col$arrival_date_day_of_month <- scale(all_col$arrival_date_day_of_month)
all_col$adr <- scale(all_col$adr)

#Check the variance
var(all_col, all_col$is_canceled)
##                                        [,1]
## is_canceled                     0.199799162
## lead_time                       0.082104206
## arrival_date_week_number        0.004851411
## arrival_date_day_of_month       0.020641887
## stays_in_weekend_nights         0.027406530
## stays_in_week_nights            0.075624715
## adults                          0.022441257
## children                        0.013683857
## babies                         -0.001062937
## is_repeated_guest              -0.007826555
## previous_cancellations          0.008420273
## previous_bookings_not_canceled -0.038834898
## adr                             0.056620396
## required_car_parking_spaces    -0.023033845
## total_of_special_requests      -0.045299525
## hotel                          -0.014988902
## arrival_date_month              0.004203200
## meal                            0.021500465
## market_segment                  0.107992895
## distribution_channel            0.062376650
## reserved_room_type              0.038705480
## deposit_type                    0.007925164
## customer_type                  -0.007358498

Machine learning model require more time and resources to process large data set, and may result in slower model training and prediction times. Therefore, we will take 20% sample size from the dataset.

#20% sample size from dataset
set.seed(1112)
sample_rows <- sample(1:nrow(all_col), 17383)
all_col1 <- all_col[sample_rows,]

all_col1$is_canceled <- as.factor(all_col1$is_canceled)

str(all_col1)
## 'data.frame':    17383 obs. of  23 variables:
##  $ is_canceled                   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
##  $ lead_time                     : num [1:17383, 1] -0.8734 -0.9082 -0.1417 0.0906 -0.9199 ...
##  $ arrival_date_week_number      : int  34 51 11 13 26 16 19 12 53 10 ...
##  $ arrival_date_day_of_month     : int  22 16 11 27 27 14 1 17 30 6 ...
##  $ stays_in_weekend_nights       : int  1 0 0 1 0 0 2 0 1 1 ...
##  $ stays_in_week_nights          : int  1 1 2 2 1 1 1 1 2 2 ...
##  $ adults                        : int  2 2 2 2 1 1 3 1 2 1 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 1 0 0 0 6 ...
##  $ adr                           : num [1:17383, 1] -0.5746 -0.3927 -0.7111 -0.7864 0.0622 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 1 0 1 0 2 0 1 1 ...
##  $ hotel                         : num  1 1 1 1 1 2 1 2 1 2 ...
##  $ arrival_date_month            : int  8 12 3 3 6 4 5 3 12 3 ...
##  $ meal                          : int  1 1 4 4 1 1 1 1 4 1 ...
##  $ market_segment                : int  4 4 7 7 3 3 7 4 7 6 ...
##  $ distribution_channel          : int  2 2 4 4 1 1 4 2 4 4 ...
##  $ reserved_room_type            : int  1 1 1 1 1 1 4 1 1 1 ...
##  $ deposit_type                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ customer_type                 : int  3 3 3 3 3 3 3 3 3 3 ...

Finally, we can fit the data to the machine learning models.

Random Forest

#Random Forest
RF<-function(s, df, col) {
  trainIndex<-createDataPartition(col, p=s, list=F)
  data_train<-df[trainIndex,]
  data_test<-df[-trainIndex,]           
  model <- randomForest(is_canceled~., data=data_train)
  
  # make predictions
  x_test <- data_test[,-1]
  y_test <- data_test[,1]
  predictions <- predict(model, x_test)
  cm<-confusionMatrix(predictions, y_test)
  return(cm)
}
split<-0.80  # 80%/20% train/test
RFresult<-RF(split, all_col1, all_col1$is_canceled)
RFresult
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2340  467
##          1  181  488
##                                           
##                Accuracy : 0.8136          
##                  95% CI : (0.8002, 0.8264)
##     No Information Rate : 0.7253          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4842          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9282          
##             Specificity : 0.5110          
##          Pos Pred Value : 0.8336          
##          Neg Pred Value : 0.7294          
##              Prevalence : 0.7253          
##          Detection Rate : 0.6732          
##    Detection Prevalence : 0.8075          
##       Balanced Accuracy : 0.7196          
##                                           
##        'Positive' Class : 0               
## 

K-Nearest Neighbours (KNN)

#Import library
library(klaR)
#KNN
KNN<-function(s, df, col) {
  trainIndex<-createDataPartition(col, p=s, list=F)
  data_train<-df[trainIndex,]
  data_test<-df[-trainIndex,]           
  model <- train(is_canceled~., data=data_train, method = "knn")
  
  # make predictions
  x_test <- data_test[,-1]
  y_test <- data_test[,1]
  predictions <- predict(model, x_test)
  cm<-confusionMatrix(predictions, y_test)
  return(list(model, cm))
}
split<-0.80  # 80%/20% train/test
KNNresult<-KNN(split, all_col1, all_col1$is_canceled)
KNNresult
## [[1]]
## k-Nearest Neighbors 
## 
## 13907 samples
##    22 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13907, 13907, 13907, 13907, 13907, 13907, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6684906  0.1209828
##   7  0.6800639  0.1213466
##   9  0.6854045  0.1146752
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
## 
## [[2]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2281  724
##          1  240  231
##                                           
##                Accuracy : 0.7227          
##                  95% CI : (0.7075, 0.7375)
##     No Information Rate : 0.7253          
##     P-Value [Acc > NIR] : 0.6419          
##                                           
##                   Kappa : 0.1741          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9048          
##             Specificity : 0.2419          
##          Pos Pred Value : 0.7591          
##          Neg Pred Value : 0.4904          
##              Prevalence : 0.7253          
##          Detection Rate : 0.6562          
##    Detection Prevalence : 0.8645          
##       Balanced Accuracy : 0.5733          
##                                           
##        'Positive' Class : 0               
## 

Decision Tree

DT<-function(s, df, col) {
  trainIndex<-createDataPartition(col, p=s, list=F)
  data_train<-df[trainIndex,]
  data_test<-df[-trainIndex,]           
  model <- ctree(is_canceled~., data=data_train)
  
  # make predictions
  x_test <- data_test[,-1]
  y_test <- data_test[,1]
  predictions <- predict(model, x_test)
  cm<-confusionMatrix(predictions, y_test)
  return(cm)
}
split<-0.80  # 80%/20% train/test
DTresult<-DT(split, all_col1, all_col1$is_canceled)
DTresult
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2296  487
##          1  225  468
##                                           
##                Accuracy : 0.7952          
##                  95% CI : (0.7814, 0.8085)
##     No Information Rate : 0.7253          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4381          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9107          
##             Specificity : 0.4901          
##          Pos Pred Value : 0.8250          
##          Neg Pred Value : 0.6753          
##              Prevalence : 0.7253          
##          Detection Rate : 0.6605          
##    Detection Prevalence : 0.8006          
##       Balanced Accuracy : 0.7004          
##                                           
##        'Positive' Class : 0               
## 

Result Analysis and Discussion 

Evaluation

The following table shows the results of different machine learning algorithms. 

Model Accuracy
Random Forest  81.36% 
k-Nearest Neighbours (kNN)  72.27% 
Decision Tree  79.52% 

As seen above, we can conclude that Random Forest is the best performing algorithm. An accuracy of 81.36% has been achieved, which means that the model is able to generalize well and there is no issue of overfitting or underfitting. 

Final Analysis

Based on the results, we learned that while there are more people who prefer to stay at City Hotel compared to Resort Hotel, those with children tend to book Resort Hotel. This might be due to most Resort Hotel offer up a lot more packages that suitable for children. The packages may include facilities, amenities, activities, and food. 

The data also shows most bookings were made in July and August. Since most guests are coming from Europe countries, we can conclude that the summer months, which runs from June to August, is the peak-season for people in Europe to travel. Consequently, we can see the rise in room prices during this season. 

Bed and Breakfast (BB) which is normally included in the room rate, is the most common meal plan in hotels. For this reason, more guests preferred BB, regardless of the type of hotel. The guests also do not need to worry about preparing breakfast or having to go out in the morning to have breakfast. 

Lastly, based on the correlation result, the top attributes that correlate to booking cancellation are ‘Required Car Parking Spaces’ and ‘Lead Time.’  

Limitation and future work 

This project used data from one single dataset, which raises questions that further study could help explain: 

  • Can the same level of model performance be achieved if more records are added into the dataset? 

  • Can comparable results be obtained from other data sources? 

Thus, additional study, with additional records and different data sources could contribute to a better understanding of the topic. Further research could also make use of attributes from additional data sources such as rating of the hotel, review counts, or currency exchange rate, to improve model performance and measure the correlation of these features with booking cancellation.