As one of the most important components on our travelling agenda, securing a hotel room in a strange city or even country ahead of our travelling seem to be one of the first things to hit on our to-do list. Thanks to the technologies and various hotel booking platforms, travelers nowadays are able to book or cancel hotel reservations in just a blink of an eye. Along with the convinience of making hotel reservations, the cancellation of a room reservation has also become much easier. As a result, the potential loss due to hotel cancellation might post additional cost to hotel management in general. In other words, to have a good sense about how cancellation activities look like has become increasingly important for hotels for their daily management as well as long-term business decision making. For this final project, I used the data collected by hotel booking platforms regarding different attributes of the reservation to better understand hotel cancellation in general, then try to answer questions in related to hotel cancellation such as what factors are closely associated with hotel cancellation, and how can we better predict hotel cancellation. With these answers and information, the hotel management team would be better informed with the following aspects:
To answer these questions, I will
The main R package used for this project is tidyverse. The core functions inbeded in tidyverse include ggplot2, tidyr, dplyr etc. Some non-core functions of the tidyverse package were also loaded.
library(tidyverse)
library(readxl)
library(lubridate)
library(magrittr)
library(glue)
library(tidyverse)
The hotels dataset used for this project was downloaded from the BANA 7025 Course Website. A total number of 32 varibales are included in the hotel dataset, most of which are categorical variables. Detailed variable name, type, and descriptions are listed below.
hotel character Hotel (H1 = Resort Hotel or H2 = City Hotel)
is_canceled double Value indicating if the booking was canceled (1) or not (0)
lead_time double Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year double Year of arrival date
arrival_date_month character Month of arrival date
arrival_date_week_number double Week number of year for arrival date
arrival_date_day_of_month double Day of arrival date
stays_in_weekend_nights double Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights double Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults double Number of adults
children double Number of children
babies double Number of babies
country character Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment character Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel character Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest double Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations double Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled double Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type character Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type character Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes double Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
agent character ID of the travel agency that made the booking
company character ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list double Number of days the booking was in the waiting list before it was confirmed to the customer
adr double Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces double Number of car parking spaces required by the customer
total_of_special_requests double Number of special requests made by the customer (e.g. twin bed or high floor)
As you can see, it is a rather long variable list, and it might be not easy to digest. To have a better idea about the variables, I grouped them into four general categories:
adults, children, babies, country, is_repeated_guest, previous_cancellations, previous_bookings_not_cancelled, agent, company, customer_type, and total_of_special_requests
hotel, meal, reserved_room_type, assigned_room_type, adr, and required_car_parking_spaces
lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_week_nights, stays_in_weekend_nights, market_segment, distribution_channel, booking_changes, deposit_type, days_in_waiting_list, and reservation_status_date
is_canceled, and reservation_status
Now let's take a closer look at the dataset.
#load dataset.
hotels <- read.csv("hotels.csv", header = TRUE)
str(hotels)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
head(hotels, 5)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
As the output above shows, there are a total of 119390 observations and 32 variables in the dataset.
To get an idea about missing data in the dataset, the is.na fuction in R programming language was used as shown below.
# missing data in general
sum(is.na(hotels))
## [1] 4
# missing data broken down by variable
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
The output above indicates that the hotels dataset is a fairly complete dataset, with only 4 cases of missing data for children. To further check for missing values that might be labeled by other expressions in the dataset (e.g., -99, 99, NULL, etc.), unique values for each variable were extracted using the R fuction table(data$variable). Due to the massive amount of values and observations, the output of this step will not be shown. Instead, the results will be synthesized in the form of narrative below.
table(hotels$is_canceled)
table(hotels$lead_time)
table(hotels$arrival_date_year)
table(hotels$arrival_date_month)
table(hotels$arrival_date_week_number)
table(hotels$arrival_date_day_of_month)
table(hotels$stays_in_weekend_nights)
table(hotels$stays_in_week_nights)
table(hotels$adults)
table(hotels$children)
table(hotels$babies)
table(hotels$meal)
table(hotels$country)
table(hotels$market_segment)
table(hotels$distribution_channel)
table(hotels$is_repeated_guest)
table(hotels$previous_cancellations)
table(hotels$previous_bookings_not_canceled)
table(hotels$booking_changes)
table(hotels$deposit_type)
table(hotels$agent)
table(hotels$company)
table(hotels$days_in_waiting_list)
table(hotels$customer_type)
table(hotels$adr)
table(hotels$required_car_parking_spaces)
table(hotels$total_of_special_requests)
table(hotels$reservation_status)
table(hotels$reservation_status_date)
Now let's take a closer look at two variavles agent and company which seem to have more NULL values than the others.
table(hotels$agent)
table(hotels$company)
The two lines of code above showed us that for variable agent and company, there were a large amount of cases being labeled as "NULL" (n=16340, 112593 respectively), which indicated missing data . (output now shown due to its large amount) To replace the "NULL" values with NA for variable agent and company, the codes below was used.
hotels$agent[hotels$agent %in% c("NULL")] <- "NA"
hotels$company[hotels$company %in% c("NULL")] <- "NA"
# QA
table(hotels$agent)
table(hotels$company)
Dupicate rows can be problematic and cause issues for data analysis and interpretation. Therefore in this step, I first identified duplicate rows (if any), and then removed them. The code and output below demonstrate this step.
# check the total number of duplicates in the dataset
sum(duplicated(hotels))
## [1] 31994
# remove duplicates
hotels_no_dup <- hotels[!duplicated(hotels), ]
# QA
sum(duplicated(hotels_no_dup))
## [1] 0
As the output above shows, there were 31994 duplicate rows in the hotel dataset. After removing the duplicate rows, the new dataset was saved as hotels_no_dup, and no more duplicate rows remianed in the new dataset.
Of the 32 variables, variable lead_timeand adr are more understandable as numeric, so summary statistics were computed as shown below. Other categorical variables were demonstrated in the form of barplots.
summary(hotels_no_dup$lead_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.00 49.00 79.89 125.00 737.00
summary(hotels_no_dup$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 72.00 98.10 106.34 134.00 5400.00
The minimum value for variable adr seems to be invalid as the averaged hotel rate should not be negative in most situations. Examined the dataset again (with the table(dataset$variable function), there was one value of "-6.38" in the dataset, along with a number of cases with very low adr values such as "0", "0.26" and "0.5" etc. The existance of various low adr values suggests that the case of "-6.38" might not be a misentering or invalid data. Instead, there must be a systematic reason for those low adr values such as gift cards or credits being redeamed or hotel compensation. In a word, without conforming with the client first, all those low values in adr are still considered to be valid and thus no action is needed now.
Barplot was used to visualize frequencies for the categorical variables. The code and output are shown below.
barplot(table(hotels_no_dup$hotel), xlab = "Hotel Type")
barplot(table(hotels_no_dup$is_canceled), xlab = "Is the Reservation Cancelled")
barplot(table(hotels_no_dup$arrival_date_year), xlab = "Arrival Year")
barplot(table(hotels_no_dup$arrival_date_month), xlab = "Arrival Month")
barplot(table(hotels_no_dup$arrival_date_week_number), xlab = "Arrival Week") # not so useful
barplot(table(hotels_no_dup$arrival_date_day_of_month), xlab = "Arrival Day")
barplot(table(hotels_no_dup$stays_in_weekend_nights), xlab = "Weekend Night Stayed")
barplot(table(hotels_no_dup$stays_in_week_nights), xlab = "Week Night Stayed")
barplot(table(hotels_no_dup$adults), xlab = "Number of Adults")
barplot(table(hotels_no_dup$children), xlab = "Number of Children")
barplot(table(hotels_no_dup$babies), xlab = "Number of Babies")
barplot(table(hotels_no_dup$meal), xlab = "Type of Meal Booked")
barplot(table(hotels_no_dup$country), xlab = "Customer Country")
barplot(table(hotels_no_dup$market_segment), xlab = "Market Segment")
barplot(table(hotels_no_dup$distribution_channel), xlab = "Booking Distribution Channel")
barplot(table(hotels_no_dup$is_repeated_guest), xlab = "Is Repeated Guest")
barplot(table(hotels_no_dup$previous_cancellations), xlab = "Previous Bookings Cancelled")
barplot(table(hotels_no_dup$previous_bookings_not_canceled), xlab = "No. of Previous Bookings Not Cancelled")
barplot(table(hotels_no_dup$reserved_room_type), xlab = "Reserved Room Type")
barplot(table(hotels_no_dup$assigned_room_type), xlab = "Assigned Room Type")
barplot(table(hotels_no_dup$booking_changes), xlab = "Number of Changes Made")
barplot(table(hotels_no_dup$deposit_type), xlab = "Deposit Type")
barplot(table(hotels_no_dup$agent), xlab = "Travel Agent ID")
barplot(table(hotels_no_dup$company), xlab = "Company ID")
barplot(table(hotels_no_dup$days_in_waiting_list), xlab = "Days in Waitlist")
barplot(table(hotels_no_dup$customer_type), xlab = "Customer Type")
barplot(table(hotels_no_dup$required_car_parking_spaces), xlab = "Number of Required Parking Spaces")
barplot(table(hotels_no_dup$total_of_special_requests), xlab = "Number of Special Requests")
barplot(table(hotels_no_dup$reservation_status), xlab = "Reservation Last Status")
barplot(table(hotels_no_dup$reservation_status_date), xlab = "Last Status Date") # not so useful
The barplots above have given us a much better sense in terms of what the data tells us about the aspects being measured. Specifically, the list below summarizes some notable patterns by just eyeballing the dataset.
hotel hotel type
is_canceled Is the reservation cancelled?
lead_time number of days between booking and arrival
arrival_date_year arrival year
arrival_date_month arrival month
arrival_date_week_number arrival week
arrival_date_day_of_month arrival day
stays_in_weekend_nights number of weekend nights stayed
stays_in_week_nights number of week nights stayed
adults number of adults
children number of children
babies number of babies
meal type of meal booked
country guest country of origin
market_segment market segment
distribution_channel distribution channel
is_repeated_guest is repeated guest
previous_cancellations number of previous cancellations
previous_bookings_not_canceled Number of previous bookings not canceled
reserved_room_type room type reserved
assigned_room_type room type assigned
booking_changes number of changes made to the booking
deposit_type deposit type
agent ID of the travel agency that made the booking
company ID of the company/entity that made the booking or responsible for paying the booking
days_in_waiting_list number of days in waitlist
customer_type type of booking
adr average daily rate
required_car_parking_spaces number of parking spaces required
total_of_special_requests number of special requests made
reservation_status reservation last status
reservation_status_date reservation last status date
Now after replacing the missing value, removing duplicates, checking for data distribution and invalid values, the datset hotels_no_dup is considered clean, and it was renamed as hotels_final. The final dataset has 87396 rows and 32 variables, and the first 10 rows showed below.
# rename the final cleaned dataset
hotels_final <- hotels_no_dup
# final data snapshot
head(hotels_final, 5)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
str(hotels_final)
## 'data.frame': 87396 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 1 1 1 ...
## $ lead_time : int 342 737 7 13 14 0 9 85 75 23 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 3 3 4 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 0 1 1 0 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
In this section, exploratary data analysis was performed for the final dataset hotels_final. My plan of doing EDA is outlined below:
is_canceled and reservation_status with other variables using x-y plotsDue to the nature that this dataset is consist of categorical variables primarily, in order to achieve my analysis goals, I will need to learn more about data manipulation in terms of fitting my categorical variables into a regression model.