| Name | Matric Number |
|---|---|
| Lee Sok Qi | 17098730 |
| Tang Wei Lai | 17143457 |
| Siti Rawiah Binti Mohamad Tarmithi | S2176711 |
| Ili Fadilah Binti Alias | S2156787 |
| Ainul Balqis Binti Sawal | 17196450 |
Hotel sector is in the constantly shifting. In order to enhance pricing, marketing and other business strategies, hotel managers must comprehend the element that affect booking demand. In this analysis, we will explore the Hotel Booking Demand dataset from Kaggle, which contains booking information for a resort hotel and a city hotel and includes information such as when the booking was made, length of stay, the number of adults, and the amount paid. Hotel sector is extremely unpredictable, and reservations depends on a wide range of factors including but not limited to hotel type and seasonality. To better assist the hotel management in making better strategies, it is more crucial to analyse the patterns found in the historical data. The historical data could be utilized by the hotel management to run the number of marketing initiatives designed to boost the businesses.
Our goal is to identify trends and patterns in the data that can provide actionable insights for the hotel management and to predict the likelihood of bookings being cancelled in order to mitigate the revenue loss derived from the booking cancellations
Question 1: Which type of the hotel have a higher number of bookings by adults with children
Question 2: What is the month with the greatest number of guest arrival?
Question 3: What is guest preferred meal plan?
Question 4: Which attributes in the dataset correlates with one another and contributes significantly for the hotel cancellation predictive modelling?
To study the correlation between hotel booking demand versus the variables from the dataset based on the data analysis. The result will be useful in developing a model to predict the likelihood of booking cancellations that is expected to provide earlier information associated with the hotel booking demand.
The dataset contains information on bookings for two hotels in Portugal (a resort and a city hotel) scheduled to arrive in a period between July 1, 2015, and August 31, 2017.
A total of 119,390 observations and 32 variables can be found in this dataset including those that were cancelled. For both hotels, the same information was collected describing 40,060 observations for the resort and 79,330 observations for the city hotel.
The dataset can be accessed from Kaggle: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand?resource=download
| No. | Variables | Description |
|---|---|---|
| 1 | hotel | Type of Hotel whether Resort Hotel or City Hotel |
| 2 | is_canceled | Value indicating if the booking was canceled (1) or not (0) |
| 3 | lead_time | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| 4 | arrival_date_year | Year of arrival date |
| 5 | arrival_date_month | Month of arrival date |
| 6 | arrival_date_week_number | week number of year for arrival date |
| 7 | arrival_date_day_of_month | Day of arrival date |
| 8 | stays_in_weekend_nights | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| 9 | stays_in_week_nights | Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| 10 | adults | Number of adults |
| 11 | children | Number of children |
| 12 | babies | Number of babies |
| 13 | meal | Type of meal booked. Undefined/SC – no meal package, BB – Bed & Breakfast, HB – Half board (breakfast and one other meal – usually dinner), FB – Full board (breakfast, lunch and dinner) |
| 14 | country | Country of origin |
| 15 | market_segment | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| 16 | distribution_channel | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| 17 | is_repeated_guest | Value indicating if the booking name was from a repeated guest (1) or not (0) |
| 18 | previous_cancellations | Number of previous bookings that were cancelled by the customer prior to the current booking |
| 19 | previous_bookings_not_canceled | Number of previous bookings not cancelled by the customer prior to the current booking |
| 20 | reserved_room_type | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
| 21 | assigned_room_type | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. |
| 22 | booking_changes | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
| 23 | deposit_type | Type of deposit made for booking: No Deposit – no deposit was made,Non Refund – a deposit was made in the value of the total stay cost, Refundable – a deposit was made with a value under the total cost of stay |
| 24 | agent | ID of the travel agency that made the booking |
| 25 | company | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
| 26 | days_in_waiting_list | Number of days the booking was in the waiting list before it was confirmed to the customer |
| 27 | customer_type | Type of booking: Contract - when the booking has an allotment or other type of contract associated to it, Group – when the booking is associated to a group, Transient – when the booking is not part of a group or contract, and is not associated to other transient booking, Transient-party – when the booking is transient, but is associated to at least other transient booking |
| 28 | adr | Average Daily Rate (ADR) as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| 29 | required_car_parking_spaces | Number of car parking spaces required by the customer |
| 30 | total_of_special_requests | Number of special requests made by the customer (e.g., twin bed or high floor) |
| 31 | reservation_status | Reservation last status:Canceled – booking was canceled by the customer,Check-Out – customer has checked in but already departed, No-Show – customer did not check-in and did inform the hotel of the reason why |
| 32 | reservation_status_date | Date at which the last status was set. This variable can be used in conjunction with the Reservation Status to understand when the booking was canceled or when did the customer checked-out of the hotel |
The third fundamental step of CRISP DM is Data Preparation. In this section, the processes include identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity.
Load the data from a .csv file
The first step is to import all relevant libraries and import the
dataset.
library(dplyr)
library(stringr)
library(tidyverse)
library(lubridate)
library(caret)
library(randomForest)
library(datasets)
library(caTools)
library(party)
library(magrittr)
library(forcats)
raw_data = read.csv("hotel_bookings.csv")
The directory of the file depends on the dataset’s location saved and it would be different and need to amend accordingly.
Explore the raw data
head(raw_data)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
print(dim(raw_data))
## [1] 119390 32
From the above code, we observed that the dataset has 119390 rows and 32 columns.
Understand Data
print(summary(raw_data))
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
The mean, median, first quartile, third quartile, minimum and maximum value of each columns are shown above.
str(raw_data)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
Data Conversion
After getting the data type, and data entry samples of each columns we change NULL values to NA. This is because in R programming NULL represents that the value in question simply doesn’t exist, rather than being existent but unknown.
raw_data[raw_data == "NULL"] <- NA
str(raw_data[,c("agent","company")])
## 'data.frame': 119390 obs. of 2 variables:
## $ agent : chr NA NA NA "304" ...
## $ company: chr NA NA NA NA ...
We also would like to change the data type of reservation_status_date to date format.
raw_data$reservation_status_date <- as.Date(raw_data$reservation_status_date)
class(raw_data$reservation_status_date)
## [1] "Date"
Missing Value
We will check if there is any missing values in the dataset
missing_column <-colSums(is.na(raw_data))
print(missing_column)
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 488
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 16340
## company days_in_waiting_list
## 112593 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
There are 4 missing values in children, 488 in country, 16340 in agent, and 112593 in company. Next we find the percentage of missing values in each column.
Find the percentage of missing values in each column
percent_na <- round((missing_column/nrow(raw_data))*100,2)
print(percent_na)
## hotel is_canceled
## 0.00 0.00
## lead_time arrival_date_year
## 0.00 0.00
## arrival_date_month arrival_date_week_number
## 0.00 0.00
## arrival_date_day_of_month stays_in_weekend_nights
## 0.00 0.00
## stays_in_week_nights adults
## 0.00 0.00
## children babies
## 0.00 0.00
## meal country
## 0.00 0.41
## market_segment distribution_channel
## 0.00 0.00
## is_repeated_guest previous_cancellations
## 0.00 0.00
## previous_bookings_not_canceled reserved_room_type
## 0.00 0.00
## assigned_room_type booking_changes
## 0.00 0.00
## deposit_type agent
## 0.00 13.69
## company days_in_waiting_list
## 94.31 0.00
## customer_type adr
## 0.00 0.00
## required_car_parking_spaces total_of_special_requests
## 0.00 0.00
## reservation_status reservation_status_date
## 0.00 0.00
From the above code chunk we obtained the percentage of missing values in each attributes. It is apparent that company has the highest percentage of 94.31% missing values, followed by agent with 13.69% missing values. Hence, we decided to drop both of these attributes.
drop <- c("company","agent")
raw_data <- raw_data[,!names(raw_data) %in% drop]
print(dim(raw_data))
## [1] 119390 30
After removing the columns, there are only 30 attributes left in the dataset.
There are still missing values in children, and country attribute. However from the previous result, we know that there is only 0.41% of missing data on column country. Meanwhile we are not able to determine whether the percentage of missing value occur in children, as it is extremely small.
raw_data[which(is.na(raw_data$children)),]
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 40601 City Hotel 1 2 2015 August
## 40668 City Hotel 1 1 2015 August
## 40680 City Hotel 1 1 2015 August
## 41161 City Hotel 1 8 2015 August
## arrival_date_week_number arrival_date_day_of_month
## 40601 32 3
## 40668 32 5
## 40680 32 5
## 41161 33 13
## stays_in_weekend_nights stays_in_week_nights adults children babies meal
## 40601 1 0 2 NA 0 BB
## 40668 0 2 2 NA 0 BB
## 40680 0 2 3 NA 0 BB
## 41161 2 5 2 NA 0 BB
## country market_segment distribution_channel is_repeated_guest
## 40601 PRT Undefined Undefined 0
## 40668 PRT Direct Undefined 0
## 40680 PRT Undefined Undefined 0
## 41161 PRT Online TA Undefined 0
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## 40601 0 0 B
## 40668 0 0 B
## 40680 0 0 B
## 41161 0 0 B
## assigned_room_type booking_changes deposit_type days_in_waiting_list
## 40601 B 0 No Deposit 0
## 40668 B 0 No Deposit 0
## 40680 B 0 No Deposit 0
## 41161 B 0 No Deposit 0
## customer_type adr required_car_parking_spaces
## 40601 Transient-Party 12.0 0
## 40668 Transient-Party 12.0 0
## 40680 Transient-Party 18.0 0
## 41161 Transient-Party 76.5 0
## total_of_special_requests reservation_status reservation_status_date
## 40601 1 Canceled 2015-08-01
## 40668 1 Canceled 2015-08-04
## 40680 2 Canceled 2015-08-04
## 41161 1 Canceled 2015-08-09
The above result shows only 4 rows of data with missing values in the children attribute. Therefore we decide to remove the relatively small missing data.
raw_data <- na.omit(raw_data)
print(dim(raw_data))
## [1] 118898 30
Duplicate Data
Lastly any duplicate data is removed from the dataset as can lead to
biased performance in the estimation model.
sum(duplicated(raw_data))
## [1] 31984
df <- unique(raw_data)
print(dim(df))
## [1] 86914 30
The cleaned dataset has 86914 rows and 30 attributes.
59.37% of people preferred to stay at City Hotel and 40.63% of people preferred to stay at Resort Hotel respectively.
#Calculate the percentage of number of confirmed bookings by type of hotels
totalHotel <- df %>% filter(is_canceled == 0) %>% group_by(hotel) %>% summarize(percentange_hotel = round(n()/nrow(.)*100,2))
#plot pie chart
ggplot(totalHotel, aes(x = "", y = percentange_hotel, fill = hotel)) +
geom_col() +
geom_label(aes(label = percentange_hotel),position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
labs(title = "Pecentage of Number of Bookings by Hotel Types", fill = "Hotel")
As referring to bar chart below, we can see that year of 2016 are the highest number of booking.
#filter out cancellation bookings and grouped by year
bar_year <- df %>% filter(is_canceled == 0) %>% group_by(arrival_date_year)
ggplot(bar_year,aes(x=arrival_date_year,fill=hotel)) + geom_bar(position="dodge") +
xlab("Year") + ylab("Number of Confirmed Bookings") + ggtitle("Number of Confirmed Bookings by Year")
July and August are the busiest months for both City Hotel and Resort Hotel. We can foresee that increase in number of bookings due to summer holiday. In addition, January & February and November & December had the lowest number of bookings among a year.
#filter out cancellation bookings and factor the months
bar1 <- df %>% filter(is_canceled == 0) %>% mutate(arrival_date_month = factor(arrival_date_month,levels = month.name)) %>%
arrange(arrival_date_month)
#filter cancelalation bookings and calculate total bookings
line1 <- df %>% filter(is_canceled == 0) %>% group_by(arrival_date_month) %>% summarise(count_arrival = n()) %>%
mutate(arrival_date_month = factor(arrival_date_month,levels = month.name)) %>%
arrange(arrival_date_month)
#plot the bar chart and line graph
ggplot() + geom_bar(data=bar1,aes(x=arrival_date_month,fill=hotel),position="dodge") +
geom_line(data=line1,aes(x=arrival_date_month,y=count_arrival,group=1)) +
geom_point(data=line1,aes(x=arrival_date_month,y=count_arrival,group=1)) +
xlab("Month of Arrival") + ylab("Number of Confirmed bookings") + ggtitle("Number of Confirmed Bookings by Month")
Both types of hotel contain large number of cancellation bookings.
#factor those cancellation booking
cancel_booking <- df
cancel_booking$is_canceled <- factor(cancel_booking$is_canceled)
ggplot(data=cancel_booking) + geom_bar(aes(x=hotel,fill=is_canceled),position="dodge") +
ggtitle("Number of Booking by Hotel")
Let’s explore more in cancelled bookings. We can see on below bar chart, “No Deposit” has the highest number of cancelled bookings.
#filter out confirmed cancellation
deposit <- df %>% filter(is_canceled == 1)
ggplot(deposit) + geom_bar(aes(x = deposit_type)) + coord_flip() +
xlab("Type of Deposits") + ggtitle("Deposit Type for Cancelled Bookings")
Let’s see the top 10 countries booking. Most guests are come from Portugal and other countries in Europe.
#filter out cancellation booking and group data by country
countries_booking <- df %>% filter(is_canceled == 0) %>% group_by(country) %>%
summarise(count_booking = n()) %>% arrange(desc(count_booking))
top10 <- head(countries_booking,10)
#plot bar chart
ggplot(data=top10) + geom_col(aes(x=fct_rev(fct_reorder(country,count_booking)),y=count_booking)) +
xlab("Country") + ylab("Number of Confirmed Booking") + ggtitle("Top 10 Countries Booking")
The figure shows that the average price per room depends on its type and standard deviation. We can observe that Type C hotel have the highest standard deviation which means that the price is changed frequently and differently.
#filtered out cancellation bookings
price_reserved <- df %>% filter(is_canceled == 0)
#plot boxplot
ggplot(price_reserved,aes(x = reserved_room_type, y= adr, fill = hotel)) + geom_boxplot() +
xlab("Type of Reserved Room") + ylab("Average Daily Rate") + ggtitle("Price Range per Reserved Room Type")
This plot showed that prices in Resort Hotel are much higher during summer and prices in City Hotel varies less and more expensive in November and December.
#price for city
price_city <- df %>% filter(is_canceled == 0,hotel == "City Hotel") %>%
select(arrival_date_month, adr) %>%
group_by(arrival_date_month) %>% summarize(avg_price_city = mean(adr))
#price for resort
price_resort <- df %>% filter(is_canceled == 0,hotel == "Resort Hotel") %>% select(arrival_date_month,adr) %>%
group_by(arrival_date_month) %>% summarise(avg_price_resort = mean(adr))
#merge two data tgt
final_price <- merge(price_city,price_resort,by="arrival_date_month")
final_price <- final_price %>% mutate(arrival_date_month = factor(arrival_date_month,levels = month.name)) %>%
arrange(arrival_date_month)
#plot
ggplot(final_price) + geom_line(aes(x=arrival_date_month,y=avg_price_city,colour = "Average Price of City"),group=1) +
geom_line(aes(x=arrival_date_month,y=avg_price_resort,colour = "Average Price of Resort"),group=1) +
xlab("Month") + ylab("Average Price") + ggtitle("Average Price per Month")
People tend to stay both hotels for 1 to 4 days. However, we can see that people also like to stay at Resort Hotel for a week.
#filter out cancellation bookings and calculate total stay of customers
stay_day <- df %>% filter(is_canceled == 0) %>%
mutate(total_night = stays_in_weekend_nights + stays_in_week_nights) %>%
group_by(total_night,hotel) %>% summarize(count_day = n())
## `summarise()` has grouped output by 'total_night'. You can override using the
## `.groups` argument.
total_days <- stay_day %>% filter(total_night < 20)
ggplot(data=total_days) + geom_col(aes(x=total_night,y=count_day,fill=hotel),position="dodge")+
xlab("Total Night") + ylab("Number of Stays")
Both hotels are having less number of bookings by families.
#calculate those adult with children & baby
family <- df %>% filter(is_canceled == 0,children >=1,babies >=1) %>%
group_by(hotel) %>% summarize(number_booking_F = n())
#calculate number of booking with non-family
non_family <- df %>% filter(is_canceled == 0,children == 0,babies == 0) %>%
group_by(hotel) %>% summarize(number_booking_NF = n())
#merge them together
total_family <- merge(family,non_family,by="hotel")
total_family <- gather(total_family,"families","n_booking",2:3)
#plot and compare family & non-family
ggplot(total_family) + geom_col(aes(x=hotel,y=n_booking,fill = families),position = "dodge") +
facet_wrap(~families) + ylab("Number of Confirmed Booking")
Narrow down the number of bookings by adult with children and baby (family). The plot clearly showed that family will tend to stay at Resort Hotel compared to City Hotel.
#narrow down those booking with family
ggplot(family) + geom_col(aes(x=hotel,y=number_booking_F),position = "dodge") +
ylab("Number of Confirmed Booking")
The preferred meal plan is bed and breakfast (BB) for both hotels. Half board (HB) and Self Catering (SC) are also famous in City Hotel. However, the guest in Resort Hotel preferred bed & breakfast (BB) and Half Board (HB).
#filtered out cancellation bookings and group by hotel & meal
mealss <- df %>% filter(is_canceled == 0) %>% group_by(hotel,meal) %>%
summarize(count_meals = n())
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
#plot bar chart
ggplot(mealss) + geom_col(aes(x=meal,y=count_meals,fill=hotel)) + facet_wrap(~hotel) +
ylab("Number of Booking")
Before we fit the data to the machine learning model, we have to select the appropriate attributes. Firstly, we could check the correlation between the target variable (is_canceled) and the other variable.
#Identify the correlation of target variable with other variables
num_col <- df %>% select_if(function(x) is.numeric(x) || is.integer(x))
x <- num_col[1]
y <- num_col[-1]
correlations <- cor(y,x)
correlations <- sort(abs(correlations[,]), decreasing = TRUE)
correlations
## required_car_parking_spaces lead_time
## 0.1837511154 0.1836828359
## adr total_of_special_requests
## 0.1266706709 0.1218249897
## booking_changes is_repeated_guest
## 0.0940744163 0.0901714750
## arrival_date_year stays_in_week_nights
## 0.0877446807 0.0827781654
## adults children
## 0.0802155750 0.0670100420
## stays_in_weekend_nights previous_cancellations
## 0.0595960345 0.0509639808
## previous_bookings_not_canceled babies
## 0.0505669732 0.0209496334
## arrival_date_day_of_month days_in_waiting_list
## 0.0052271944 0.0042967581
## arrival_date_week_number
## 0.0007948913
From the correlation result, we can identify the appropriate attributes and drop the less important attributes.
num_col1 <- num_col[,c("is_canceled","lead_time","arrival_date_week_number","arrival_date_day_of_month",
"stays_in_weekend_nights","stays_in_week_nights","adults","children",
"babies","is_repeated_guest", "previous_cancellations",
"previous_bookings_not_canceled", "adr", "required_car_parking_spaces",
"total_of_special_requests")]
cat_col <- df %>% select_if(is.character)
cat_col1 <- cat_col[,c("hotel","arrival_date_month","meal","market_segment",
"distribution_channel","reserved_room_type","deposit_type","customer_type")]
Next, we will need to change the data type of categorical attributes from character to integer.
#Identify the levels of categorical attributes
cat_col2 <- cat_col1
cat_col2[] <- lapply(cat_col2, factor)
lapply(cat_col2, levels)
## $hotel
## [1] "City Hotel" "Resort Hotel"
##
## $arrival_date_month
## [1] "April" "August" "December" "February" "January" "July"
## [7] "June" "March" "May" "November" "October" "September"
##
## $meal
## [1] "BB" "FB" "HB" "SC" "Undefined"
##
## $market_segment
## [1] "Aviation" "Complementary" "Corporate" "Direct"
## [5] "Groups" "Offline TA/TO" "Online TA"
##
## $distribution_channel
## [1] "Corporate" "Direct" "GDS" "TA/TO" "Undefined"
##
## $reserved_room_type
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "L" "P"
##
## $deposit_type
## [1] "No Deposit" "Non Refund" "Refundable"
##
## $customer_type
## [1] "Contract" "Group" "Transient" "Transient-Party"
#change the data type
cat_col1$hotel <- ifelse(cat_col1$hotel == "City Hotel", 1, 2)
cat_col1$arrival_date_month <- ifelse(cat_col1$arrival_date_month == "January", 1,
ifelse(cat_col1$arrival_date_month == "February", 2,
ifelse(cat_col1$arrival_date_month == "March", 3,
ifelse(cat_col1$arrival_date_month == "April" , 4,
ifelse(cat_col1$arrival_date_month == "May", 5,
ifelse(cat_col1$arrival_date_month == "June", 6,
ifelse(cat_col1$arrival_date_month == "July", 7,
ifelse(cat_col1$arrival_date_month == "August", 8,
ifelse(cat_col1$arrival_date_month == "September", 9,
ifelse(cat_col1$arrival_date_month == "October", 10,
ifelse(cat_col1$arrival_date_month == "November", 11,
ifelse(cat_col1$arrival_date_month == "December", 12, cat_col1$arrival_date_month))))))))))))
cat_col1$meal <- ifelse(cat_col1$meal == "BB", 1,
ifelse(cat_col1$meal == "FB", 2,
ifelse(cat_col1$meal == "HB", 3,
ifelse(cat_col1$meal == "SC", 4,
ifelse(cat_col1$meal == "Undefined", 5, cat_col1$meal)))))
cat_col1$market_segment <- ifelse(cat_col1$market_segment == "Aviation", 1,
ifelse(cat_col1$market_segment == "Complementary", 2,
ifelse(cat_col1$market_segment == "Corporate", 3,
ifelse(cat_col1$market_segment == "Direct", 4,
ifelse(cat_col1$market_segment == "Groups", 5,
ifelse(cat_col1$market_segment == "Offline TA/TO", 6,
ifelse(cat_col1$market_segment == "Online TA", 7,
ifelse(cat_col1$market_segment == "Undefined", 8, cat_col1$market_segment))))))))
cat_col1$distribution_channel <- ifelse(cat_col1$distribution_channel == "Corporate", 1,
ifelse(cat_col1$distribution_channel == "Direct", 2,
ifelse(cat_col1$distribution_channel == "GDS", 3,
ifelse(cat_col1$distribution_channel == "TA/TO", 4,
ifelse(cat_col1$distribution_channel == "Undefined", 5, cat_col1$distribution_channel)))))
cat_col1$reserved_room_type <- ifelse(cat_col1$reserved_room_type == "A", 1,
ifelse(cat_col1$reserved_room_type == "B", 2,
ifelse(cat_col1$reserved_room_type == "C", 3,
ifelse(cat_col1$reserved_room_type == "D", 4,
ifelse(cat_col1$reserved_room_type == "E", 5,
ifelse(cat_col1$reserved_room_type == "F", 6,
ifelse(cat_col1$reserved_room_type == "G", 7,
ifelse(cat_col1$reserved_room_type == "H", 8,
ifelse(cat_col1$reserved_room_type == "L", 9,
ifelse(cat_col1$reserved_room_type == "P", 10, cat_col1$reserved_room_type))))))))))
cat_col1$deposit_type <- ifelse(cat_col1$deposit_type == "No Deposit", 1,
ifelse(cat_col1$deposit_type == "Non Refund", 2,
ifelse(cat_col1$deposit_type == "Refundable", 3, cat_col1$deposit_type)))
cat_col1$customer_type <- ifelse(cat_col1$customer_type == "Contract", 1,
ifelse(cat_col1$customer_type == "Group", 2,
ifelse(cat_col1$customer_type == "Transient", 3,
ifelse(cat_col1$customer_type == "Transient-Party", 4, cat_col1$customer_type))))
cat_col1$arrival_date_month <- as.integer(cat_col1$arrival_date_month)
cat_col1$meal <- as.integer(cat_col1$meal)
cat_col1$market_segment <- as.integer(cat_col1$market_segment)
cat_col1$distribution_channel <- as.integer(cat_col1$distribution_channel)
cat_col1$reserved_room_type <- as.integer(cat_col1$reserved_room_type)
cat_col1$deposit_type <- as.integer(cat_col1$deposit_type)
cat_col1$customer_type <- as.integer(cat_col1$customer_type)
#combine both numerical attributes and categorical attributes
all_col <- cbind(num_col1, cat_col1)
Next, we can normalize the data to improve the performance of machine learning models
#check the variance
var(all_col, all_col$is_canceled)
## [,1]
## is_canceled 0.199799162
## lead_time 7.069410459
## arrival_date_week_number 0.004851411
## arrival_date_day_of_month 0.020641887
## stays_in_weekend_nights 0.027406530
## stays_in_week_nights 0.075624715
## adults 0.022441257
## children 0.013683857
## babies -0.001062937
## is_repeated_guest -0.007826555
## previous_cancellations 0.008420273
## previous_bookings_not_canceled -0.038834898
## adr 3.111834316
## required_car_parking_spaces -0.023033845
## total_of_special_requests -0.045299525
## hotel -0.014988902
## arrival_date_month 0.004203200
## meal 0.021500465
## market_segment 0.107992895
## distribution_channel 0.062376650
## reserved_room_type 0.038705480
## deposit_type 0.007925164
## customer_type -0.007358498
#Normalize
all_col$lead_time <- scale(all_col$lead_time)
# all_col$arrival_date_week_number <- scale(all_col$arrival_date_week_number)
# all_col$arrival_date_day_of_month <- scale(all_col$arrival_date_day_of_month)
all_col$adr <- scale(all_col$adr)
#Check the variance
var(all_col, all_col$is_canceled)
## [,1]
## is_canceled 0.199799162
## lead_time 0.082104206
## arrival_date_week_number 0.004851411
## arrival_date_day_of_month 0.020641887
## stays_in_weekend_nights 0.027406530
## stays_in_week_nights 0.075624715
## adults 0.022441257
## children 0.013683857
## babies -0.001062937
## is_repeated_guest -0.007826555
## previous_cancellations 0.008420273
## previous_bookings_not_canceled -0.038834898
## adr 0.056620396
## required_car_parking_spaces -0.023033845
## total_of_special_requests -0.045299525
## hotel -0.014988902
## arrival_date_month 0.004203200
## meal 0.021500465
## market_segment 0.107992895
## distribution_channel 0.062376650
## reserved_room_type 0.038705480
## deposit_type 0.007925164
## customer_type -0.007358498
Machine learning model require more time and resources to process large data set, and may result in slower model training and prediction times. Therefore, we will take 20% sample size from the dataset.
#20% sample size from dataset
set.seed(1112)
sample_rows <- sample(1:nrow(all_col), 17383)
all_col1 <- all_col[sample_rows,]
all_col1$is_canceled <- as.factor(all_col1$is_canceled)
str(all_col1)
## 'data.frame': 17383 obs. of 23 variables:
## $ is_canceled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
## $ lead_time : num [1:17383, 1] -0.8734 -0.9082 -0.1417 0.0906 -0.9199 ...
## $ arrival_date_week_number : int 34 51 11 13 26 16 19 12 53 10 ...
## $ arrival_date_day_of_month : int 22 16 11 27 27 14 1 17 30 6 ...
## $ stays_in_weekend_nights : int 1 0 0 1 0 0 2 0 1 1 ...
## $ stays_in_week_nights : int 1 1 2 2 1 1 1 1 2 2 ...
## $ adults : int 2 2 2 2 1 1 3 1 2 1 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ is_repeated_guest : int 0 0 0 0 0 1 0 0 0 1 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 1 0 0 0 6 ...
## $ adr : num [1:17383, 1] -0.5746 -0.3927 -0.7111 -0.7864 0.0622 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 1 0 1 0 2 0 1 1 ...
## $ hotel : num 1 1 1 1 1 2 1 2 1 2 ...
## $ arrival_date_month : int 8 12 3 3 6 4 5 3 12 3 ...
## $ meal : int 1 1 4 4 1 1 1 1 4 1 ...
## $ market_segment : int 4 4 7 7 3 3 7 4 7 6 ...
## $ distribution_channel : int 2 2 4 4 1 1 4 2 4 4 ...
## $ reserved_room_type : int 1 1 1 1 1 1 4 1 1 1 ...
## $ deposit_type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ customer_type : int 3 3 3 3 3 3 3 3 3 3 ...
Finally, we can fit the data to the machine learning models.
#Random Forest
RF<-function(s, df, col) {
trainIndex<-createDataPartition(col, p=s, list=F)
data_train<-df[trainIndex,]
data_test<-df[-trainIndex,]
model <- randomForest(is_canceled~., data=data_train)
# make predictions
x_test <- data_test[,-1]
y_test <- data_test[,1]
predictions <- predict(model, x_test)
cm<-confusionMatrix(predictions, y_test)
return(cm)
}
split<-0.80 # 80%/20% train/test
RFresult<-RF(split, all_col1, all_col1$is_canceled)
RFresult
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2340 467
## 1 181 488
##
## Accuracy : 0.8136
## 95% CI : (0.8002, 0.8264)
## No Information Rate : 0.7253
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4842
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9282
## Specificity : 0.5110
## Pos Pred Value : 0.8336
## Neg Pred Value : 0.7294
## Prevalence : 0.7253
## Detection Rate : 0.6732
## Detection Prevalence : 0.8075
## Balanced Accuracy : 0.7196
##
## 'Positive' Class : 0
##
#Import library
library(klaR)
#KNN
KNN<-function(s, df, col) {
trainIndex<-createDataPartition(col, p=s, list=F)
data_train<-df[trainIndex,]
data_test<-df[-trainIndex,]
model <- train(is_canceled~., data=data_train, method = "knn")
# make predictions
x_test <- data_test[,-1]
y_test <- data_test[,1]
predictions <- predict(model, x_test)
cm<-confusionMatrix(predictions, y_test)
return(list(model, cm))
}
split<-0.80 # 80%/20% train/test
KNNresult<-KNN(split, all_col1, all_col1$is_canceled)
KNNresult
## [[1]]
## k-Nearest Neighbors
##
## 13907 samples
## 22 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13907, 13907, 13907, 13907, 13907, 13907, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6684906 0.1209828
## 7 0.6800639 0.1213466
## 9 0.6854045 0.1146752
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
##
## [[2]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2281 724
## 1 240 231
##
## Accuracy : 0.7227
## 95% CI : (0.7075, 0.7375)
## No Information Rate : 0.7253
## P-Value [Acc > NIR] : 0.6419
##
## Kappa : 0.1741
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9048
## Specificity : 0.2419
## Pos Pred Value : 0.7591
## Neg Pred Value : 0.4904
## Prevalence : 0.7253
## Detection Rate : 0.6562
## Detection Prevalence : 0.8645
## Balanced Accuracy : 0.5733
##
## 'Positive' Class : 0
##
DT<-function(s, df, col) {
trainIndex<-createDataPartition(col, p=s, list=F)
data_train<-df[trainIndex,]
data_test<-df[-trainIndex,]
model <- ctree(is_canceled~., data=data_train)
# make predictions
x_test <- data_test[,-1]
y_test <- data_test[,1]
predictions <- predict(model, x_test)
cm<-confusionMatrix(predictions, y_test)
return(cm)
}
split<-0.80 # 80%/20% train/test
DTresult<-DT(split, all_col1, all_col1$is_canceled)
DTresult
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2296 487
## 1 225 468
##
## Accuracy : 0.7952
## 95% CI : (0.7814, 0.8085)
## No Information Rate : 0.7253
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4381
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9107
## Specificity : 0.4901
## Pos Pred Value : 0.8250
## Neg Pred Value : 0.6753
## Prevalence : 0.7253
## Detection Rate : 0.6605
## Detection Prevalence : 0.8006
## Balanced Accuracy : 0.7004
##
## 'Positive' Class : 0
##
The following table shows the results of different machine learning algorithms.
| Model | Accuracy |
|---|---|
| Random Forest | 81.36% |
| k-Nearest Neighbours (kNN) | 72.27% |
| Decision Tree | 79.52% |
As seen above, we can conclude that Random Forest is the best performing algorithm. An accuracy of 81.36% has been achieved, which means that the model is able to generalize well and there is no issue of overfitting or underfitting.
Based on the results, we learned that while there are more people who prefer to stay at City Hotel compared to Resort Hotel, those with children tend to book Resort Hotel. This might be due to most Resort Hotel offer up a lot more packages that suitable for children. The packages may include facilities, amenities, activities, and food.
The data also shows most bookings were made in July and August. Since most guests are coming from Europe countries, we can conclude that the summer months, which runs from June to August, is the peak-season for people in Europe to travel. Consequently, we can see the rise in room prices during this season.
Bed and Breakfast (BB) which is normally included in the room rate, is the most common meal plan in hotels. For this reason, more guests preferred BB, regardless of the type of hotel. The guests also do not need to worry about preparing breakfast or having to go out in the morning to have breakfast.
Lastly, based on the correlation result, the top attributes that correlate to booking cancellation are ‘Required Car Parking Spaces’ and ‘Lead Time.’
This project used data from one single dataset, which raises questions that further study could help explain:
Can the same level of model performance be achieved if more records are added into the dataset?
Can comparable results be obtained from other data sources?
Thus, additional study, with additional records and different data sources could contribute to a better understanding of the topic. Further research could also make use of attributes from additional data sources such as rating of the hotel, review counts, or currency exchange rate, to improve model performance and measure the correlation of these features with booking cancellation.