Akash Khanvilkar , Guna Shanmukha Gedela , Sindhu Herle
Hospitality generates revenue for local economies directly when tourists spend money in hotels, restaurants and entertainment venues. Hospitality industry is growing, with more and more people spending their money for vacation and leisure activities. People may only lodge into a hotel when it’s a holiday season or a special event, thus the demand for staying room is not equally distributed across the year. Hotel industry is a very volatile industry and the bookings depend on variety of factors such as type of hotels, seasonality, days of week and many more. This makes analyzing the patterns available in the past data more important to help the hotels plan better. Using the historical data, hotels can perform various campaigns to boost the business. The data consists of around 119,390 booking transactions from 2 hotel: an anonymous city hotel from Lisbon and a resort hotel from Algarve. The dataset comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. There is so much to explore from this data, but we will only focus on demand forecasting.
We will tackle this problem statement in three segments: 1. We will be analyzing some key metrics for hotel bookings like: Most preferred meal types, country wise bookings, new customers acquired, the number of cancellations, number of bookings on weekday vs weekends, customer lifetime value of the existing customers, type of rooms preferred by customers, booking types, hotels available for booking and the revenue of the hotels 2. Further, we will be using various angles to look through the data to analyze patterns associated with each segment such as: The type of hotel, day of week, type of customers,type of rooms 3. Furthermore, we will also try to predict the future bookings either based on time series analysis or decision trees.
The project aims to gain interesting insight into customers’ behavior when booking a hotel. To maximize the revenue gained by the hotel, the management often employed a pricing strategy, one of them being raising the room rate when the demand is high and making a promo when the demand is low. Thus, the ability to accurately forecast the future demand is very important and became a vital part on the pricing scheme. The demand for different segment of customer may differ and forecasting become harder as it may requires different model for different segment.These insights can guide hotels to adjust their customer strategies and make preparation for unknown.
Rpart, rpart.plot and ROCR : These packages are used for building classification and regression models using decision trees. Further, we can visualize the tree structure and evaluate the performance of the models.
Tidyverse : This package consists of 6 core packages out of which the below 3 are most important for this project: dplyr: Used for data manipulation tidyr: Used for data modifications ggplot2: Used for creating powerful visualizations.
Forecast, tseries and sarima : These packages are used to model the time-series data including the seasonal component in the series (if any).
#Importing the packages required
library(rpart) #used for classification trees
library(rpart.plot) #used for plotting the trees
library(tidyverse) #used for data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rmarkdown) #used for formatting the markdown file
library(lubridate) #Used for manipulating dates
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
#library("forecast") #used for time-series forecasting
library("tseries") #used for time-series forecasting
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library("sarima") #used for time-series forecasting including seasonal components
## Loading required package: stats4
The data set contains the following variables:
Data structure
This data set contains 119390 observations and 32 variables. Some of the variables are integer, others are character. It includes information about hotel names, lead time, number of children, agent, and much more.
#Importing the dataset
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(hotels)
## spec_tbl_df [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
#Structure of the dataset
str(hotels)
## spec_tbl_df [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
Missing value By checking missing values for each variables, we can see that there are only 4 missing value for variable “children”, no missing value in other variables.
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
Taking care of the missing value
hotels$children[is.na(hotels$children)] <- median(hotels$children, na.rm = TRUE)
Total bookings across years
hotels %>%
group_by(is_repeated_guest) %>%
count(is_repeated_guest) %>%
summarise(percent = round(n/nrow(hotels)*100,2), total = n)
## # A tibble: 2 × 3
## is_repeated_guest percent total
## <dbl> <dbl> <int>
## 1 0 96.8 115580
## 2 1 3.19 3810
Leading time difference between two groups of guests
guestbehabar <- function(behavior){
hotels %>%
ggplot(aes(is_repeated_guest, fill = behavior)) +
geom_bar(position = "fill") +
labs(title = "Behavior feature by guest type",
subtitle = behavior,
x = "Guest type (1 for repeated guests)",
y = "Percentage of group") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
}
behaviourboxplot <- function(behavior){
hotels %>%
ggplot(aes(x = is_repeated_guest, y = behavior)) +
geom_boxplot() +
geom_jitter(width = .15, alpha = .2) +
labs(title = "Behavior feature by guest type",
subtitle = behavior,
x = "Guest type (1 for repeated guests)",
y = "count") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
}
behaviourboxplot(hotels$lead_time)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Type of hotel booked by repeated guests
guestbehabar(hotels$hotel)
It turns out that repeated guests prefer City Hotel other than Resort Hotel. However, the percentage of repeated guests that book Resort Hotel is 46.7%, while only 33.1% of unrepeated guests book Resort Hotel.
hotels %>%
group_by(is_repeated_guest) %>%
filter(hotel == "Resort Hotel") %>%
count() -> filhot
hotels %>%
group_by(is_repeated_guest) %>%
count() -> total
as.data.frame(filhot/total)
## is_repeated_guest n
## 1 NaN 0.3312165
## 2 1 0.4666667
Repeated guests cancelling booking
hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
hotels$is_canceled <- as.factor(hotels$is_canceled)
guestbehabar(hotels$is_canceled)
hotels %>%
group_by(is_repeated_guest) %>%
filter(is_canceled == "1") %>%
count() -> filcan
hotels %>%
group_by(is_repeated_guest) %>%
count() -> total
as.data.frame(filcan/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.3778508
## 2 NA 0.1448819
Meal choice of repeated guests:
guestbehabar(hotels$meal)
table(hotels$meal, hotels$is_repeated_guest)
##
## 0 1
## BB 88837 3473
## FB 789 9
## HB 14277 186
## SC 10540 110
## Undefined 1137 32
hotels %>%
group_by(is_repeated_guest) %>%
filter(meal == "BB") %>%
count() -> filmea
as.data.frame(filmea/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.7686191
## 2 NA 0.9115486
Percentage of repeated guests making deposit
guestbehabar(hotels$deposit_type)
Normally no. 98.2% of repeated guests don’t make deposit. A possiable reason is that they are reliable guests, so that they don’t need to make deposit. The percentage for unrepeated guests is 87.3%.
hotels %>%
group_by(is_repeated_guest) %>%
filter(deposit_type == "No Deposit") %>%
count() -> fildep
as.data.frame(fildep/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.8729798
## 2 NA 0.9821522
Customer type of repeated guests
guestbehabar(hotels$customer_type)
80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking. 4.2% of the repeated guests make bookings that are associated to a group. Both of these two numbers are higher than that of unrepeated guests making the bookings.
custype <- function(type){
hotels %>%
group_by(is_repeated_guest) %>%
filter(customer_type == type) %>%
count()
}
custype("Transient") -> filtra
custype("Group") -> filgro
as.data.frame(filtra/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.7487455
## 2 NA 0.8065617
The probability of a repeated guest cancel a booking is much lower than that of a unrepeated guest does. This informs that repeated guests are much loyal.
81.1% of repeated guests don’t make change of booking.
98.2% of repeated guests don’t make deposit. Higher than unrepeated guests do.
80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking.
Only 3.19% of the guests are repeated guests.
Repeated guests tend to book the hotel one month ahead of visiting, which is much shorter than that made by unrepeated guests. This indicates repeated guests don’t rush to book hotels may because they always know which hotel to book if visiting that place.