Hotel industry is a very volatile industry and the bookings depend on variety of factors such as type of hotels, seasonality, days of week and many more. This makes analyzing the patterns available in the past data more important to help the hotels plan better. Using the historical data, hotels can perform various campaigns to boost the business. We can use the patterns to predict the future bookings using time series or decision trees.
We will be using the data available to analyze the factors affecting the hotel bookings. These factors can be used for reporting the trends and predict the future bookings.
We will be tackling this problem statement in three stages:
Using the results from the above analysis, business can make key decisions regarding the customer experience they desire to deliver.
rpart, rpart.plot and ROCR : These packages are used for building classification and regression models using decision trees. Further, we can visualize the tree structure and evaluate the performance of the models
forecast, tseries and sarima : These packages are used to model the time-series data including the seasonal component in the series if any
library(tidyverse) #used for data manipulation
library(rmarkdown) #used for formatting the markdown file
library(lubridate) #Used for manipulating dates
library(rpart) #used for classification trees
library(rpart.plot) #used for plotting the trees
library(ROCR) #used for evaluating tree performance
library("forecast") #used for time-series forecasting
library("tseries") #used for time-series forecasting
library("sarima") #used for time-series forecasting including seasonal components
The data has been source from TidyTuesday website, which is a home for various amazing R projects.
The data originally comes from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019
Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted. Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields. The data was extracted from hotels’ Property Management System (PMS) SQL databases Both hotels are located in Portugal: H1 at the resort region of Algarve and H2 at the city of Lisbon.
Below is the data flow diagram used for collecting the data:
hotel.data <- read.csv("G:/UCinn Course Work/Sem2/R/Final Project/hotels/hotels.csv")
The total number of rows in data is 119390. The dataset has 32 columns
Viewing the raw data
knitr::kable(head(hotel.data,n = 10), "pandoc")
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
| Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
Data type of each variable with sample values:
str(hotel.data)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ country : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ agent : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
## $ company : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ reservation_status_date : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
is_canceled and is_repeated_guest.hotel column are 0, 1is_repeated_guest column are 0, 1hotel.data$is_canceled <- as.factor(hotel.data$is_canceled)
hotel.data$is_repeated_guest <- as.factor(hotel.data$is_repeated_guest)
Checking the missing values in each column
Below are the missing values in each column:
colSums(is.na(hotel.data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
hotel.data$children[is.na(hotel.data$children)] <- 0
sum(is.na(hotel.data))
## [1] 0
children. We can impute these values by 0.Below is the summary statistics of each variable:
summary(hotel.data)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 0:75166 Min. : 0 Min. :2015
## Resort Hotel:40060 1:44224 1st Qu.: 18 1st Qu.:2016
## Median : 69 Median :2016
## Mean :104 Mean :2016
## 3rd Qu.:160 3rd Qu.:2017
## Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :13877 Min. : 1.00 Min. : 1.0
## July :12661 1st Qu.:16.00 1st Qu.: 8.0
## May :11791 Median :28.00 Median :16.0
## October:11160 Mean :27.17 Mean :15.8
## April :11089 3rd Qu.:38.00 3rd Qu.:23.0
## June :10939 Max. :53.00 Max. :31.0
## (Other):47873
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB :92310 PRT :48590
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB : 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB :14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC :10650 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 Undefined: 1169 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 0:115580
## Offline TA/TO:24219 Direct :14645 1: 3810
## Groups :19811 GDS : 193
## Direct :12606 TA/TO :97870
## Corporate : 5295 Undefined: 5
## Complementary: 743
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1118
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 Min. : 0.0000 No Deposit:104641 9 :31961
## D :25322 1st Qu.: 0.0000 Non Refund: 14587 NULL :16340
## E : 7806 Median : 0.0000 Refundable: 162 240 :13922
## F : 3751 Mean : 0.2211 1 : 7191
## G : 2553 3rd Qu.: 0.0000 14 : 3640
## C : 2375 Max. :21.0000 7 : 3539
## (Other): 3530 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112593 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25124
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date
## Canceled :43017 2015-10-21: 1461
## Check-Out:75166 2015-07-06: 805
## No-Show : 1207 2016-11-25: 790
## 2015-01-01: 763
## 2016-01-18: 625
## 2015-07-02: 469
## (Other) :114477
Observations:
City Hotel has 79330 number of records and Resort Hotel has 40060 number of recordsAug has maximum number of bookingsUndefined channelA rooms are preferred in majority of the bookingsagent and company column are text stored as Nulladr column has negative values which must be investigatedarrival date and reservation_status_dateHistogram of Yearly Bookings
hist(hotel.data$arrival_date_year, prob = T, col = "yellow", breaks = 10, main = "Histogram Yearly Bookings", xlab = "Yearly Bookings")
Majority bookings can be seen in 2016
Histogram of Avg. Daily Price
hist(hotel.data$adr, prob = T, col = "grey", breaks = 100,xlim = c(min(hotel.data$adr) - 1,max(hotel.data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")
There seem some outliers in the adr column
Below are the plots of various numeric variables:
boxplot(hotel.data$stays_in_weekend_nights,main = "Boxplot : Weekend Nights")
boxplot(hotel.data$stays_in_week_nights,main = "Boxplot : Weekday Nights")
boxplot(hotel.data$adults,main = "Boxplot : # of Adults")
boxplot(hotel.data$children,main = "Boxplot : # of Children")
boxplot(hotel.data$babies,main = "Boxplot : # of Babies")
boxplot(hotel.data$previous_cancellations,main = "Boxplot : # of Previous Cancellations")
boxplot(hotel.data$previous_bookings_not_canceled,main = "Boxplot : # of Previous Non-Cancellations")
boxplot(hotel.data$booking_changes,main = "Boxplot : # of Booking Changes")
boxplot(hotel.data$days_in_waiting_list,main = "Boxplot : Days in Waiting List")
boxplot(hotel.data$adr,main = "Boxplot : Avg. Daily Price")
boxplot(hotel.data$required_car_parking_spaces,main = "Boxplot : # of Car Parking Spaces Req.")
boxplot(hotel.data$total_of_special_requests,main = "Boxplot : # of Spl. Requests")
Observations * For the children column, we observe that there is one instance where there are 10 children but most of the other values are within 4. We might want to point this out to the clients to verify * Similarly, for the babies column, we observe 2 values beyond 4 * days in waiting list column has values as high as 400. This means that the waiting time has been more than 1 year. We might need to communicate this to client and decide a method to limit this number for further analysis * in the adr column, only one value lies beyond $ 5k which must be an outlier. We will remove this value as it might skew the summary statistics * For rest of the columns, we do not see anything unusual
adr column. Also, adr had some negative entries which we will remove from the datasethotel.data <- hotel.data[hotel.data$adr < max(hotel.data$adr) & hotel.data$adr > 0 ,]
boxplot(hotel.data$adr, main = "Boxplot : Avg. Daily Price")
adr columnsummary(hotel.data$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.26 70.53 95.00 103.49 126.00 510.00
hotel.data.low.adr <- hotel.data[hotel.data$adr < 5,]
adr column has lowest value as 0.26. We need to discuss such values with the client. Let us store the records where adr has values less than $ 5adrhist(hotel.data$adr, prob = T, col ="grey", breaks = 100, xlim = c(min(hotel.data$adr) - 1 ,max(hotel.data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")
paged_table(hotel.data)
The number of bookings will be different for the two hotel types. Further, the volume of booking and the revenue will vary across the years. Let us analyze the # of hotel bookings and revenue across years and slice the data by type of the hotel.
Total bookings across years
ggplot(data = hotel.data,aes(arrival_date_year,fill = (hotel))) +
geom_bar(position = 'dodge') +
scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
xlab("Years") +
ggtitle("Hotel Type vs # of bookings across years") +
labs(fill = 'Hotel Type')
Total revenue across years split by hotel type
plot <- ggplot(hotel.data) +
stat_summary(aes(x = arrival_date_year, y = adr,fill = (hotel)),
fun.y = function(x) sum(x),
geom = "bar") +
scale_y_continuous(name = "Total Annual Revenue",labels = scales::dollar) +
xlab("Years") +
ggtitle("Hotel Type vs Total Annual Revenue across years") +
labs(fill = 'Hotel Type')
plot
2016 while the bookings seem to be less in 2015 and 2017. Anything special about 2016?Rearranging the factors to arrange months
hotel.data <- hotel.data %>%
mutate(Months = fct_relevel(arrival_date_month,"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December"))
Total bookings across months
ggplot(data = hotel.data,aes(Months,fill = (hotel))) +
geom_bar(position = 'dodge') +
facet_grid(hotel.data$arrival_date_year) +
scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
xlab("Months") +
ggtitle("Hotel Type vs # of bookings across months") +
labs(fill = 'Hotel Type')
2015 and 2017 is not for all the monthsAnother interesting analysis to look at is the number of cancellations. We need to analyze if the number cancelled is high as compared to confirmed bookings. If yes, is it specific to a hotel type? Or a particular year?
Number of bookings that were converted and cancelled for the two type of hotels
ggplot(data = hotel.data,aes(hotel,fill = (is_canceled))) +
geom_bar(position = 'dodge') +
scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
xlab("Hotel Type") +
ggtitle("Hotel Type vs # of bookings across booking status") +
labs(fill = 'Booking Status')
* The ratio of bookings cancelled to the confirmed bookings seems high for
City Hotels * Is the number of cancellations high for a particular year? * Let us find out:
Number of bookings that were converted and cancelled across years
ggplot(data = hotel.data,aes(arrival_date_year,fill = (is_canceled))) +
geom_bar(position = 'dodge') +
scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
xlab("Years") +
ggtitle("Booking Status vs # of bookings across years") +
labs(fill = 'Booking Status')
As we observe that the number of weeks is not same across the years. The number of weeks captured in 2015 and 2017 is not complete. Hence, we analyze the accurate way to analyze the data is scaling down to the per unit value.
paged_table(hotel.data %>%
group_by(arrival_date_year) %>%
summarize(min = min(arrival_date_week_number),
max = max(arrival_date_week_number),
weeks = max(arrival_date_week_number) - min(arrival_date_week_number) + 1))
Calculating the average number of bookings and average revenue generated per week
hotel.summary <- hotel.data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(hotel.count = n(),
hotel.rates = sum(adr)) %>%
transmute(arrival_date_year,
avg.weekly.bookings = case_when(arrival_date_year == '2015' ~ round(hotel.count/27),
arrival_date_year == '2017' ~ round(hotel.count/35),
TRUE ~ round(hotel.count/53)),
avg.weekly.earnings = case_when(arrival_date_year == '2015' ~ round(hotel.rates/27),
arrival_date_year == '2017' ~ round(hotel.rates/35),
TRUE ~ round(hotel.rates/53)),
avg.weekly.earnings.per.booking = avg.weekly.earnings/avg.weekly.bookings)
paged_table(hotel.summary)
Plotting the Avg. Weekly Bookings
ggplot(hotel.summary,aes(x = hotel ,y = avg.weekly.bookings,color = hotel)) +
geom_point(size = 3) +
facet_grid(hotel.summary$arrival_date_year) +
ylab("Avg.numberof Weekly Bookings") +
xlab("Hotel Type") +
ggtitle("Avg.numberof Weekly Bookings vs Hotel Type across years") +
labs(color = 'Hotel Type')
* These normalized metrics give a better understanding of the data. We observe that the avg. # of bookings in a week is growing year over year * The growth is higher from
2015 to 2016 as compared to the jump from 2016 to 2017 * Let us look at the revenue:
Avg. Weekly Revenue
ggplot(hotel.summary,aes(x = hotel ,y = avg.weekly.earnings,color = hotel)) +
geom_point(size = 3) +
facet_grid(hotel.summary$arrival_date_year) +
scale_y_continuous(name = "Avg. weekly revenue",labels = scales::dollar) +
xlab("Hotel Type") +
ggtitle("Avg. weekly revenue vs Hotel Type across years") +
labs(color = 'Hotel Type')
* The growth is higher from
2015 to 2016 as compared to the jump from 2016 to 2017 * The revenue almost doubled in 2016 but it did not grow at the same rate from 2016 to 2017
Calculating the average number of new customers per week
Hotel business is also driven by the new customers coming in. This generally is a result of the marketing by hotel business or by word of mouth. Nevertheless, we need to analyze if the revenue and the volume of bookings governed by the repeated customers or the new customers.
cust.summary <- hotel.data %>%
filter(is_canceled == 0) %>%
group_by(arrival_date_year,is_repeated_guest) %>%
summarize(cust.count = sum(as.numeric(is_repeated_guest))) %>%
transmute(Customer.Type = ifelse(is_repeated_guest == 0,'Existing','New'),
avg.weekly.cust = case_when(arrival_date_year == '2015' ~ round(cust.count/27),
arrival_date_year == '2017' ~ round(cust.count/35),
TRUE ~ round(cust.count/53)))
paged_table(cust.summary)
New Customers and Existing Customers increases drastically in 2016 and the rise is not that high in 2017# of Customers split across customer type
ggplot(cust.summary,aes(x = Customer.Type ,y = avg.weekly.cust,color = Customer.Type)) +
geom_point(size = 3) +
facet_grid(cust.summary$arrival_date_year) +
ylab("Avg.numberof customers") +
xlab("Customer Type") +
ggtitle("Avg.numberof customers vs Customer Type across years") +
labs(color = 'Customer Type')
We can analyze the split of revenue across the type of customer
There can be significant differences in the behavior of the two type of customers. We need to dive a level deeper into the revenue generated per customer for each of these types to get a better understanding of their behaviors:
revenue.cust.summary <- hotel.data %>%
filter(is_canceled == 0) %>%
group_by(arrival_date_year,is_repeated_guest) %>%
summarize(hotel.rates = round(sum(adr)),
cust.count = sum(as.numeric(is_repeated_guest))) %>%
transmute(Customer.Type = ifelse(is_repeated_guest == 0,'Existing','New'),
hotel.rates,
cust.count,
avg.revenue = round(hotel.rates/cust.count))
paged_table(revenue.cust.summary)
New Customers tend to spend significantly lower than the Existing CustomersNew Customers?Existing Customers to maintain revenue flowPer customer revenue
ggplot(revenue.cust.summary,aes(x = Customer.Type ,y = avg.revenue,color = Customer.Type)) +
geom_point(size = 3) +
facet_grid(cust.summary$arrival_date_year) +
ylab("Avg.revenue per customer") +
xlab("Customer Type") +
ggtitle("Avg.revenue per customer vs Customer Type across years") +
labs(color = 'Customer Type')
Hotel industry is also driven by the Customer Experience. Imagine if you book a hotel and you are not provided the preferred room, or you are not served the type of breakfast served. Would you recommend the hotel to your family and friends? No way!
For this reason, we need to keep a track of the type of room preferred, the breakfast requirements and any other facto that might enhance customer experience. This makes analyzing the service indicators important which will help hotel to plan better for enhancing the customer experience.
We can analyze the meal preferred
meal.summary <- hotel.data %>%
filter(is_canceled == 0) %>%
group_by(meal) %>%
summarize(meal_count = n()) %>%
arrange(-meal_count) %>%
transmute(Meal.Type=case_when(meal == 'BB' ~ 'Bed & Breakfast',
meal=='HB' ~ 'Half board (breakfast and one other meal – usually dinner)',
meal=='FB' ~ 'Full board (breakfast, lunch and dinner)',
meal=='SC'~ 'No meal package',
TRUE ~ 'Not Selected'),
meal_count)
paged_table(meal.summary)
Full Board bookings. This gives an opportunity to the hotels to increase revenue from these type of meals by offering some discountsWe can analyze the room preferred
room.summary <- hotel.data %>%
filter(is_canceled == 0) %>%
group_by(reserved_room_type) %>%
summarize(room_count = n()) %>%
arrange(-room_count)
paged_table(room.summary)
Type A room followed by other typesType L rooms to conclude about the 4 bookings in these roomsWe can analyze the performance across weekdays and weekends across years and hotel type We know that majority of holidays call for hotels and we expect a high volume of booking on Weekends as compared to Weekdays.
Wait! Are you sure? What about the corporate bookings? The corporate meetings and get togethers happen on Weekdays. Let us see what purpose our hotel chain serves:
week.summary <- hotel.data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(avg.weekend.stay = round(sum(stays_in_weekend_nights)/2),
avg.weekday.stay = round(sum(stays_in_week_nights)/5))
paged_table(week.summary)
Weekend numbers by 2 and Weekday numbers by 5 to keep the same scaleWeekday bookings are higher than the Weekend numbers. That is an interesting finding. We need to investigate our corporate bookings!Analyzing if we can predict the cancellation of the booking based on the factors present in the data
Imagine if we could predict which booking will be cancelled based on past data. Won’t that bring magic to your business? We will be able to increase the waiting list based on our model.
Let us try to predict that using the type of hotel, type of customer, market segment and month of booking.
We use the rpart function which helps us build a classification tree to classify the booking status(based on the is_cancelled variable)
hotel.tree <- rpart(formula = is_canceled ~ hotel+
is_repeated_guest+market_segment+arrival_date_month,
data = hotel.data, method = "class",
parms = list(loss = matrix(c(0,1,1,0), nrow = 2)),cp = 0.001)
Visualizing the tree structure built and interpreting the results
rpart.plot(hotel.tree)
Interpretation * It is very easy to traverse the tree * Let us give it a shot. If the market segment is one of the following {Aviation, Complementary, Corporate, Direct, Offline TA/TO or Online TA} there is 33% probability of cancellation=0 (Confirmation in booking) and 83% of the population lies in this bucket * Hence this node is classified as cancellation=0 (prob<50%)
Model Assessment
Performance parameter evaluation
hotel.tree.prob.rpart = predict(hotel.tree, type = "prob")
pred = ROCR::prediction(hotel.tree.prob.rpart[,2], hotel.data$is_canceled)
perf = performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)
AUC.Tree.insample <- slot(performance(pred, "auc"), "y.values")[[1]]
hotel.tree.pred.tree <- predict(hotel.tree, hotel.data, type = "class")
MR.Tree <- mean(hotel.data$is_canceled != hotel.tree.pred.tree)
MR.Tree
## [1] 0.3242981
For any business to plan financially, we need to predict the revenue flow in the industry. We can use the historical data to predict the future revenue to plan for the ups and downs in the revenue and plan better. We will be forecasting the monthly revenue for the business.
Aggregating the data at month level across years. We will be using the monthly data to predict the expected adr for the following 12 months
month.data <- hotel.data %>%
filter(is_canceled == 0) %>%
group_by(arrival_date_year,Months) %>%
summarize(hotel.monthly.rates = sum(adr))
month.data
## # A tibble: 26 x 3
## # Groups: arrival_date_year [3]
## arrival_date_year Months hotel.monthly.rates
## <int> <fct> <dbl>
## 1 2015 July 166262.
## 2 2015 August 261839.
## 3 2015 September 296950.
## 4 2015 October 256342.
## 5 2015 November 108922.
## 6 2015 December 138731.
## 7 2016 January 104424.
## 8 2016 February 175684.
## 9 2016 March 248802.
## 10 2016 April 291174.
## # ... with 16 more rows
Subset the aggregated adr variable to build the time series
hotel.monthly.adr <- as.ts(month.data[,3])
Visualizing the Time Series
plot(hotel.monthly.adr)
Using the auto.arima function to predict the AR, MA, and differencing components of our time series
model1 <- auto.arima(hotel.monthly.adr)
Storing the forecasted values and plotting them
forecast.hotel <- forecast(model1,h=12)
plot(forecast.hotel)
Through this analysis, we were able to answer key business questions which are stated below:
New Customers or Existing Customers?Methodology
Insights
2016 while the bookings seem to be less in 2015 and 2017. This is majorly due to more data points being available in 2016City Hotels2015 to 2016 as compared to the jump from 2016 to 20172015 to 2016 as compared to the jump from 2016 to 20172016 but it did not grow at the same rate from 2016 to 2017New Customers and Existing Customers increases drastically in 2016 and the rise is not that high in 2017New Customers tend to spend significantly lower than the Existing CustomersFull Board bookings. This gives an opportunity to the hotels to increase revenue from these type of meals by offering some discountsType A room followed by other typesType L rooms to conclude about the 4 bookings in these roomsWeekday bookings are higher than the Weekend numbers. That is an interesting finding. We need to investigate our corporate bookings!Implication to business
2015 to 2016 was higher as compared to the following year. We can analyze the differences across years to identify the extraordinary performance and match itLimitations
2015 and 2017 is for different months. Even though we have converted them to same base line of weekly numbers, there are chances that some weeks perform differently as compared to other weeks