Hotel Bookings

Final Project

Introduction

Hotel industry is a very volatile industry and the bookings depend on variety of factors such as type of hotels, seasonality, days of week and many more. This makes analyzing the patterns available in the past data more important to help the hotels plan better. Using the historical data, hotels can perform various campaigns to boost the business. We can use the patterns to predict the future bookings using time series or decision trees.

We will be using the data available to analyze the factors affecting the hotel bookings. These factors can be used for reporting the trends and predict the future bookings.

We will be tackling this problem statement in three stages:

We will be analyzing some key metrics for hotel bookings like:

The number of cancellations
Number of bookings on weekday vs weekends
Most preferred meal types
Country wise bookings
New customers acquired
Customer lifetime value of the existing customers
Type of rooms preferred by customers
Booking types,
Hotels available for booking
The revenue of the hotels

We will be using various lenses to look through the data to analyze patterns associated with each segment such as:

The type of hotel
Day of week
Type of customers
Type of rooms

Finally, we will also try to predict the future bookings either based on time series analysis or decision trees.

Using the results from the above analysis, business can make key decisions regarding the customer experience they desire to deliver.

Packages Required

tidyverse : This package consists of 6 core packages out of which the below 3 are most important for this project:

ggplot2: Used for creating powerful visualizations
dplyr: Used for data manipulation
tidyr: Used for data modifications

rpart, rpart.plot and ROCR : These packages are used for building classification and regression models using decision trees. Further, we can visualize the tree structure and evaluate the performance of the models
forecast, tseries and sarima : These packages are used to model the time-series data including the seasonal component in the series if any

library(tidyverse) #used for data manipulation
library(rmarkdown) #used for formatting the markdown file
library(lubridate) #Used for manipulating dates


library(rpart) #used for classification trees
library(rpart.plot) #used for plotting the trees
library(ROCR) #used for evaluating tree performance

library("forecast") #used for time-series forecasting
library("tseries") #used for time-series forecasting
library("sarima") #used for time-series forecasting including seasonal components

Data Preparation

The data has been source from TidyTuesday website, which is a home for various amazing R projects.

The data originally comes from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019

Data Description

Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted. Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields. The data was extracted from hotels’ Property Management System (PMS) SQL databases Both hotels are located in Portugal: H1 at the resort region of Algarve and H2 at the city of Lisbon.

Below is the data flow diagram used for collecting the data:

Data Importing and Cleaning

hotel.data <- read.csv("G:/UCinn Course Work/Sem2/R/Final Project/hotels/hotels.csv")

The total number of rows in data is 119390. The dataset has 32 columns

Viewing the raw data

knitr::kable(head(hotel.data,n = 10), "pandoc")

hotel	is_canceled	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_week_nights	adults	meal	country	market_segment	distribution_channel	reserved_room_type	assigned_room_type	booking_changes	deposit_type	agent	company	customer_type	adr	total_of_special_requests	reservation_status	reservation_status_date
Resort Hotel	0	342	2015	July	27	1	0	2	BB	PRT	Direct	Direct	C	C	3	No Deposit	NULL	NULL	Transient	0.0	0	Check-Out	2015-07-01
Resort Hotel	0	737	2015	July	27	1	0	2	BB	PRT	Direct	Direct	C	C	4	No Deposit	NULL	NULL	Transient	0.0	0	Check-Out	2015-07-01
Resort Hotel	0	7	2015	July	27	1	1	1	BB	GBR	Direct	Direct	A	C	0	No Deposit	NULL	NULL	Transient	75.0	0	Check-Out	2015-07-02
Resort Hotel	0	13	2015	July	27	1	1	1	BB	GBR	Corporate	Corporate	A	A	0	No Deposit	304	NULL	Transient	75.0	0	Check-Out	2015-07-02
Resort Hotel	0	14	2015	July	27	1	2	2	BB	GBR	Online TA	TA/TO	A	A	0	No Deposit	240	NULL	Transient	98.0	1	Check-Out	2015-07-03
Resort Hotel	0	14	2015	July	27	1	2	2	BB	GBR	Online TA	TA/TO	A	A	0	No Deposit	240	NULL	Transient	98.0	1	Check-Out	2015-07-03
Resort Hotel	0	0	2015	July	27	1	2	2	BB	PRT	Direct	Direct	C	C	0	No Deposit	NULL	NULL	Transient	107.0	0	Check-Out	2015-07-03
Resort Hotel	0	9	2015	July	27	1	2	2	FB	PRT	Direct	Direct	C	C	0	No Deposit	303	NULL	Transient	103.0	1	Check-Out	2015-07-03
Resort Hotel	1	85	2015	July	27	1	3	2	BB	PRT	Online TA	TA/TO	A	A	0	No Deposit	240	NULL	Transient	82.0	1	Canceled	2015-05-06
Resort Hotel	1	75	2015	July	27	1	3	2	HB	PRT	Offline TA/TO	TA/TO	D	D	0	No Deposit	15	NULL	Transient	105.5	0	Canceled	2015-04-22

Datatypes of variables

Data type of each variable with sample values:

str(hotel.data)

## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ country                       : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
##  $ market_segment                : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel          : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
##  $ assigned_room_type            : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ agent                         : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
##  $ company                       : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ reservation_status_date       : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...

Based on the column definitions in data dictionary there are few candidates that might have datatype mismatch. Columns is_canceled and is_repeated_guest.
Let us check the unique values for these columns:
The unique values in hotel column are 0, 1
The unique values in is_repeated_guest column are 0, 1
Now we can convert these to factors data type

hotel.data$is_canceled <- as.factor(hotel.data$is_canceled)
hotel.data$is_repeated_guest <- as.factor(hotel.data$is_repeated_guest)

Checking the missing values in each column

Below are the missing values in each column:

colSums(is.na(hotel.data))

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

hotel.data$children[is.na(hotel.data$children)] <- 0

sum(is.na(hotel.data))

## [1] 0

The only column which has missing values is children. We can impute these values by 0.
Post imputing, the number of missing values in entire dataset are 0

Data Summary

Below is the summary statistics of each variable:

summary(hotel.data)

##           hotel       is_canceled   lead_time   arrival_date_year
##  City Hotel  :79330   0:75166     Min.   :  0   Min.   :2015     
##  Resort Hotel:40060   1:44224     1st Qu.: 18   1st Qu.:2016     
##                                   Median : 69   Median :2016     
##                                   Mean   :104   Mean   :2016     
##                                   3rd Qu.:160   3rd Qu.:2017     
##                                   Max.   :737   Max.   :2017     
##                                                                  
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  August :13877      Min.   : 1.00            Min.   : 1.0             
##  July   :12661      1st Qu.:16.00            1st Qu.: 8.0             
##  May    :11791      Median :28.00            Median :16.0             
##  October:11160      Mean   :27.17            Mean   :15.8             
##  April  :11089      3rd Qu.:38.00            3rd Qu.:23.0             
##  June   :10939      Max.   :53.00            Max.   :31.0             
##  (Other):47873                                                        
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies                 meal          country     
##  Min.   : 0.0000   Min.   : 0.000000   BB       :92310   PRT    :48590  
##  1st Qu.: 0.0000   1st Qu.: 0.000000   FB       :  798   GBR    :12129  
##  Median : 0.0000   Median : 0.000000   HB       :14463   FRA    :10415  
##  Mean   : 0.1039   Mean   : 0.007949   SC       :10650   ESP    : 8568  
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000   Undefined: 1169   DEU    : 7287  
##  Max.   :10.0000   Max.   :10.000000                     ITA    : 3766  
##                                                          (Other):28635  
##        market_segment  distribution_channel is_repeated_guest
##  Online TA    :56477   Corporate: 6677      0:115580         
##  Offline TA/TO:24219   Direct   :14645      1:  3810         
##  Groups       :19811   GDS      :  193                       
##  Direct       :12606   TA/TO    :97870                       
##  Corporate    : 5295   Undefined:    5                       
##  Complementary:  743                                         
##  (Other)      :  239                                         
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                A      :85994     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                D      :19201     
##  Median : 0.00000       Median : 0.0000                E      : 6535     
##  Mean   : 0.08712       Mean   : 0.1371                F      : 2897     
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                G      : 2094     
##  Max.   :26.00000       Max.   :72.0000                B      : 1118     
##                                                        (Other): 1551     
##  assigned_room_type booking_changes       deposit_type        agent      
##  A      :74053      Min.   : 0.0000   No Deposit:104641   9      :31961  
##  D      :25322      1st Qu.: 0.0000   Non Refund: 14587   NULL   :16340  
##  E      : 7806      Median : 0.0000   Refundable:   162   240    :13922  
##  F      : 3751      Mean   : 0.2211                       1      : 7191  
##  G      : 2553      3rd Qu.: 0.0000                       14     : 3640  
##  C      : 2375      Max.   :21.0000                       7      : 3539  
##  (Other): 3530                                            (Other):42797  
##     company       days_in_waiting_list         customer_type  
##  NULL   :112593   Min.   :  0.000      Contract       : 4076  
##  40     :   927   1st Qu.:  0.000      Group          :  577  
##  223    :   784   Median :  0.000      Transient      :89613  
##  67     :   267   Mean   :  2.321      Transient-Party:25124  
##  45     :   250   3rd Qu.:  0.000                             
##  153    :   215   Max.   :391.000                             
##  (Other):  4354                                               
##       adr          required_car_parking_spaces total_of_special_requests
##  Min.   :  -6.38   Min.   :0.00000             Min.   :0.0000           
##  1st Qu.:  69.29   1st Qu.:0.00000             1st Qu.:0.0000           
##  Median :  94.58   Median :0.00000             Median :0.0000           
##  Mean   : 101.83   Mean   :0.06252             Mean   :0.5714           
##  3rd Qu.: 126.00   3rd Qu.:0.00000             3rd Qu.:1.0000           
##  Max.   :5400.00   Max.   :8.00000             Max.   :5.0000           
##                                                                         
##  reservation_status reservation_status_date
##  Canceled :43017    2015-10-21:  1461      
##  Check-Out:75166    2015-07-06:   805      
##  No-Show  : 1207    2016-11-25:   790      
##                     2015-01-01:   763      
##                     2016-01-18:   625      
##                     2015-07-02:   469      
##                     (Other)   :114477

Observations:

There are 2 type of hotels. City Hotel has 79330 number of records and Resort Hotel has 40060 number of records
The number of bookings is 44224
Aug has maximum number of bookings
Second half of year generally observes more booking as the median value of the week is greater than 26
Weekdays observe longer stays as compared to weekends
The min value in number of adult’s column is 0 and we need to inspect such records
There are few records in distribution channel are mapped to Undefined channel
Type A rooms are preferred in majority of the bookings
There are few records in agent and company column are text stored as Null
adr column has negative values which must be investigated
One additional check we can do is to check the difference between arrival date and reservation_status_date

Visual Summary

Histogram of Yearly Bookings

hist(hotel.data$arrival_date_year, prob = T, col = "yellow", breaks = 10, main = "Histogram Yearly Bookings", xlab = "Yearly Bookings")

Majority bookings can be seen in 2016

Histogram of Avg. Daily Price

hist(hotel.data$adr, prob = T, col = "grey", breaks = 100,xlim = c(min(hotel.data$adr) - 1,max(hotel.data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")

There seem some outliers in the adr column

Outliers

Below are the plots of various numeric variables:

boxplot(hotel.data$stays_in_weekend_nights,main = "Boxplot : Weekend Nights")

boxplot(hotel.data$stays_in_week_nights,main = "Boxplot : Weekday Nights")

boxplot(hotel.data$adults,main = "Boxplot : # of Adults")

boxplot(hotel.data$children,main = "Boxplot : # of Children")

boxplot(hotel.data$babies,main = "Boxplot : # of Babies")

boxplot(hotel.data$previous_cancellations,main = "Boxplot : # of Previous Cancellations")

boxplot(hotel.data$previous_bookings_not_canceled,main = "Boxplot : # of Previous Non-Cancellations")

boxplot(hotel.data$booking_changes,main = "Boxplot : # of Booking Changes")

boxplot(hotel.data$days_in_waiting_list,main = "Boxplot : Days in Waiting List")

boxplot(hotel.data$adr,main = "Boxplot : Avg. Daily Price")

boxplot(hotel.data$required_car_parking_spaces,main = "Boxplot : # of Car Parking Spaces Req.")

boxplot(hotel.data$total_of_special_requests,main = "Boxplot : # of Spl. Requests")

Observations * For the children column, we observe that there is one instance where there are 10 children but most of the other values are within 4. We might want to point this out to the clients to verify * Similarly, for the babies column, we observe 2 values beyond 4 * days in waiting list column has values as high as 400. This means that the waiting time has been more than 1 year. We might need to communicate this to client and decide a method to limit this number for further analysis * in the adr column, only one value lies beyond $ 5k which must be an outlier. We will remove this value as it might skew the summary statistics * For rest of the columns, we do not see anything unusual

Removing Outliers

Removing one outlier value in adr column. Also, adr had some negative entries which we will remove from the dataset
Post cleaning the data, below is the boxplot:

hotel.data <- hotel.data[hotel.data$adr < max(hotel.data$adr) & hotel.data$adr > 0 ,]

boxplot(hotel.data$adr, main = "Boxplot : Avg. Daily Price")

Let us check the summary of adr column

summary(hotel.data$adr)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.26   70.53   95.00  103.49  126.00  510.00

hotel.data.low.adr <- hotel.data[hotel.data$adr < 5,]

We still observe that the adr column has lowest value as 0.26. We need to discuss such values with the client. Let us store the records where adr has values less than $ 5
There are 65 such values
Let us also check the distribution of adr

hist(hotel.data$adr, prob = T, col ="grey", breaks = 100, xlim = c(min(hotel.data$adr) - 1 ,max(hotel.data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")

Finally, let us take a glimpse of the cleaned data

paged_table(hotel.data)

Exploratory Data Analysis

Annual Metrics

The number of bookings will be different for the two hotel types. Further, the volume of booking and the revenue will vary across the years. Let us analyze the # of hotel bookings and revenue across years and slice the data by type of the hotel.

Total bookings across years

ggplot(data = hotel.data,aes(arrival_date_year,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
  xlab("Years") +
  ggtitle("Hotel Type vs # of bookings across years") +
  labs(fill = 'Hotel Type')

Total revenue across years split by hotel type

plot <- ggplot(hotel.data) + 
  stat_summary(aes(x = arrival_date_year, y = adr,fill = (hotel)), 
               fun.y = function(x) sum(x), 
               geom = "bar") +
  scale_y_continuous(name = "Total Annual Revenue",labels = scales::dollar) +
  xlab("Years") +
  ggtitle("Hotel Type vs Total Annual Revenue across years") +
  labs(fill = 'Hotel Type')


plot

The number of bookings seems to be high in 2016 while the bookings seem to be less in 2015 and 2017. Anything special about 2016?
Well we have the answer in the following analysis little further!
Is this trend governed by specific months or is it uniform growth across months?
Let us find out:

Rearranging the factors to arrange months

hotel.data <- hotel.data %>% 
  mutate(Months = fct_relevel(arrival_date_month,"January",
                              "February",
                              "March",
                              "April",
                              "May",
                              "June",
                              "July",
                              "August",
                              "September",
                              "October",
                              "November",
                              "December"))

Total bookings across months

ggplot(data = hotel.data,aes(Months,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  facet_grid(hotel.data$arrival_date_year) +
  scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
  xlab("Months") +
  ggtitle("Hotel Type vs # of bookings across months") +
  labs(fill = 'Hotel Type')

Here we go! We can see that the data for 2015 and 2017 is not for all the months
This indicates that all the annual numbers need to be normalized to a common baseline
In the further analysis, we will be converting the annual numbers to weekly based on the # of weeks reported in the data

Metrics by Hotel Type

Another interesting analysis to look at is the number of cancellations. We need to analyze if the number cancelled is high as compared to confirmed bookings. If yes, is it specific to a hotel type? Or a particular year?

Number of bookings that were converted and cancelled for the two type of hotels

ggplot(data = hotel.data,aes(hotel,fill = (is_canceled))) +
  geom_bar(position = 'dodge') +
  scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
  xlab("Hotel Type") +
  ggtitle("Hotel Type vs # of bookings across booking status") +
  labs(fill = 'Booking Status')

* The ratio of bookings cancelled to the confirmed bookings seems high for City Hotels * Is the number of cancellations high for a particular year? * Let us find out:

Number of bookings that were converted and cancelled across years

ggplot(data = hotel.data,aes(arrival_date_year,fill = (is_canceled))) +
  geom_bar(position = 'dodge') +
  scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
  xlab("Years") +
  ggtitle("Booking Status vs # of bookings across years") +
  labs(fill = 'Booking Status')

Rationalized Metrics

As we observe that the number of weeks is not same across the years. The number of weeks captured in 2015 and 2017 is not complete. Hence, we analyze the accurate way to analyze the data is scaling down to the per unit value.

paged_table(hotel.data %>% 
  group_by(arrival_date_year) %>%
  summarize(min = min(arrival_date_week_number),
            max = max(arrival_date_week_number),
            weeks = max(arrival_date_week_number) - min(arrival_date_week_number) + 1))

Calculating the average number of bookings and average revenue generated per week

hotel.summary <- hotel.data %>%
  filter(is_canceled == 0) %>% 
  group_by(hotel,arrival_date_year) %>%
  summarize(hotel.count = n(),
            hotel.rates = sum(adr)) %>%
  transmute(arrival_date_year,
            avg.weekly.bookings = case_when(arrival_date_year == '2015' ~ round(hotel.count/27),
                                arrival_date_year == '2017' ~ round(hotel.count/35),
                                TRUE ~ round(hotel.count/53)),
         avg.weekly.earnings = case_when(arrival_date_year == '2015' ~ round(hotel.rates/27),
                                arrival_date_year == '2017' ~ round(hotel.rates/35),
                                TRUE ~ round(hotel.rates/53)),
         avg.weekly.earnings.per.booking = avg.weekly.earnings/avg.weekly.bookings)
paged_table(hotel.summary)

Plotting the Avg. Weekly Bookings

ggplot(hotel.summary,aes(x = hotel ,y = avg.weekly.bookings,color = hotel)) +
  geom_point(size = 3) +
  facet_grid(hotel.summary$arrival_date_year) +
  ylab("Avg.numberof Weekly Bookings") +
  xlab("Hotel Type") +
  ggtitle("Avg.numberof Weekly Bookings vs Hotel Type across years") +
  labs(color = 'Hotel Type')

* These normalized metrics give a better understanding of the data. We observe that the avg. # of bookings in a week is growing year over year * The growth is higher from 2015 to 2016 as compared to the jump from 2016 to 2017 * Let us look at the revenue:

Avg. Weekly Revenue

ggplot(hotel.summary,aes(x = hotel ,y = avg.weekly.earnings,color = hotel)) +
  geom_point(size = 3) +
  facet_grid(hotel.summary$arrival_date_year) +
  scale_y_continuous(name = "Avg. weekly revenue",labels = scales::dollar) +
  xlab("Hotel Type") +
  ggtitle("Avg. weekly revenue vs Hotel Type across years") +
  labs(color = 'Hotel Type')

* The growth is higher from 2015 to 2016 as compared to the jump from 2016 to 2017 * The revenue almost doubled in 2016 but it did not grow at the same rate from 2016 to 2017

Calculating the average number of new customers per week

Hotel business is also driven by the new customers coming in. This generally is a result of the marketing by hotel business or by word of mouth. Nevertheless, we need to analyze if the revenue and the volume of bookings governed by the repeated customers or the new customers.

cust.summary <- hotel.data %>%
  filter(is_canceled == 0) %>% 
  group_by(arrival_date_year,is_repeated_guest) %>%
  summarize(cust.count = sum(as.numeric(is_repeated_guest))) %>%
  transmute(Customer.Type = ifelse(is_repeated_guest == 0,'Existing','New'),
            avg.weekly.cust = case_when(arrival_date_year == '2015' ~ round(cust.count/27),
                                arrival_date_year == '2017' ~ round(cust.count/35),
                                TRUE ~ round(cust.count/53)))
paged_table(cust.summary)

We observe that the number of New Customers and Existing Customers increases drastically in 2016 and the rise is not that high in 2017
Let us visualize this difference:

# of Customers split across customer type

ggplot(cust.summary,aes(x = Customer.Type ,y = avg.weekly.cust,color = Customer.Type)) +
  geom_point(size = 3) +
  facet_grid(cust.summary$arrival_date_year) +
  ylab("Avg.numberof customers") +
  xlab("Customer Type") +
  ggtitle("Avg.numberof customers vs Customer Type across years") +
  labs(color = 'Customer Type')

We can analyze the split of revenue across the type of customer

There can be significant differences in the behavior of the two type of customers. We need to dive a level deeper into the revenue generated per customer for each of these types to get a better understanding of their behaviors:

revenue.cust.summary <- hotel.data %>%
  filter(is_canceled == 0) %>% 
  group_by(arrival_date_year,is_repeated_guest) %>%
  summarize(hotel.rates = round(sum(adr)),
            cust.count = sum(as.numeric(is_repeated_guest))) %>% 
  transmute(Customer.Type = ifelse(is_repeated_guest == 0,'Existing','New'),
            hotel.rates,
            cust.count,
            avg.revenue = round(hotel.rates/cust.count))


paged_table(revenue.cust.summary)

We observe that the New Customers tend to spend significantly lower than the Existing Customers
This opens a question to the business:
- How to increase the spend of New Customers?
- Need to maintain good relations with the Existing Customers to maintain revenue flow

Per customer revenue

ggplot(revenue.cust.summary,aes(x = Customer.Type ,y = avg.revenue,color = Customer.Type)) +
  geom_point(size = 3) +
  facet_grid(cust.summary$arrival_date_year) +
  ylab("Avg.revenue per customer") +
  xlab("Customer Type") +
  ggtitle("Avg.revenue per customer vs Customer Type across years") +
  labs(color = 'Customer Type')

Improving Customer Experience

Hotel industry is also driven by the Customer Experience. Imagine if you book a hotel and you are not provided the preferred room, or you are not served the type of breakfast served. Would you recommend the hotel to your family and friends? No way!

For this reason, we need to keep a track of the type of room preferred, the breakfast requirements and any other facto that might enhance customer experience. This makes analyzing the service indicators important which will help hotel to plan better for enhancing the customer experience.

We can analyze the meal preferred

meal.summary <- hotel.data %>%
  filter(is_canceled == 0) %>% 
  group_by(meal) %>%
  summarize(meal_count = n()) %>% 
  arrange(-meal_count) %>%  
  transmute(Meal.Type=case_when(meal == 'BB' ~ 'Bed & Breakfast',
                                   meal=='HB' ~ 'Half board (breakfast and one other meal – usually dinner)',
                                   meal=='FB' ~ 'Full board (breakfast, lunch and dinner)',
                             meal=='SC'~ 'No meal package',      
                             TRUE ~ 'Not Selected'),
            meal_count)

paged_table(meal.summary)

Clearly, majority of bookings are accompanied with a breakfast or dinner
Another key insight here is that there are few Full Board bookings. This gives an opportunity to the hotels to increase revenue from these type of meals by offering some discounts

We can analyze the room preferred

room.summary <- hotel.data %>%
  filter(is_canceled == 0) %>% 
  group_by(reserved_room_type) %>%
  summarize(room_count = n()) %>% 
  arrange(-room_count)

paged_table(room.summary)

The above table suggests that the hotels need to have more of Type A room followed by other types
Additionally, we can investigate the Type L rooms to conclude about the 4 bookings in these rooms

We can analyze the performance across weekdays and weekends across years and hotel type We know that majority of holidays call for hotels and we expect a high volume of booking on Weekends as compared to Weekdays.

Wait! Are you sure? What about the corporate bookings? The corporate meetings and get togethers happen on Weekdays. Let us see what purpose our hotel chain serves:

week.summary <- hotel.data %>%
  filter(is_canceled == 0) %>% 
  group_by(hotel,arrival_date_year) %>%
  summarize(avg.weekend.stay = round(sum(stays_in_weekend_nights)/2),
            avg.weekday.stay = round(sum(stays_in_week_nights)/5))

paged_table(week.summary)

Here, we have divided the Weekend numbers by 2 and Weekday numbers by 5 to keep the same scale
We observe that Weekday bookings are higher than the Weekend numbers. That is an interesting finding. We need to investigate our corporate bookings!

Machine Learning

Trees based classification

Analyzing if we can predict the cancellation of the booking based on the factors present in the data

Imagine if we could predict which booking will be cancelled based on past data. Won’t that bring magic to your business? We will be able to increase the waiting list based on our model.

Let us try to predict that using the type of hotel, type of customer, market segment and month of booking.

We use the rpart function which helps us build a classification tree to classify the booking status(based on the is_cancelled variable)

hotel.tree <- rpart(formula = is_canceled ~ hotel+
                              is_repeated_guest+market_segment+arrival_date_month,
                            data = hotel.data, method = "class", 
                            parms = list(loss = matrix(c(0,1,1,0), nrow = 2)),cp = 0.001)

Visualizing the tree structure built and interpreting the results

rpart.plot(hotel.tree)

Interpretation * It is very easy to traverse the tree * Let us give it a shot. If the market segment is one of the following {Aviation, Complementary, Corporate, Direct, Offline TA/TO or Online TA} there is 33% probability of cancellation=0 (Confirmation in booking) and 83% of the population lies in this bucket * Hence this node is classified as cancellation=0 (prob<50%)

Model Assessment

Performance parameter evaluation

hotel.tree.prob.rpart = predict(hotel.tree, type = "prob")
pred = ROCR::prediction(hotel.tree.prob.rpart[,2], hotel.data$is_canceled)
perf = performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)

AUC.Tree.insample <- slot(performance(pred, "auc"), "y.values")[[1]]

hotel.tree.pred.tree <- predict(hotel.tree, hotel.data, type = "class")
MR.Tree <- mean(hotel.data$is_canceled != hotel.tree.pred.tree)
MR.Tree

## [1] 0.3242981

The AUC curve describes the model accuracy. The higher the area under the curve the better is the model
The misclassification rate of the model is 32% which suggests that the model spills out wrong results 32% of the time

Time Series Forecasting

For any business to plan financially, we need to predict the revenue flow in the industry. We can use the historical data to predict the future revenue to plan for the ups and downs in the revenue and plan better. We will be forecasting the monthly revenue for the business.

Aggregating the data at month level across years. We will be using the monthly data to predict the expected adr for the following 12 months

month.data <- hotel.data %>%
  filter(is_canceled == 0) %>% 
  group_by(arrival_date_year,Months) %>%
  summarize(hotel.monthly.rates = sum(adr))

month.data

## # A tibble: 26 x 3
## # Groups:   arrival_date_year [3]
##    arrival_date_year Months    hotel.monthly.rates
##                <int> <fct>                   <dbl>
##  1              2015 July                  166262.
##  2              2015 August                261839.
##  3              2015 September             296950.
##  4              2015 October               256342.
##  5              2015 November              108922.
##  6              2015 December              138731.
##  7              2016 January               104424.
##  8              2016 February              175684.
##  9              2016 March                 248802.
## 10              2016 April                 291174.
## # ... with 16 more rows

Subset the aggregated adr variable to build the time series

hotel.monthly.adr <- as.ts(month.data[,3])

Visualizing the Time Series

plot(hotel.monthly.adr)

Using the auto.arima function to predict the AR, MA, and differencing components of our time series

model1 <- auto.arima(hotel.monthly.adr)

Storing the forecasted values and plotting them

forecast.hotel <- forecast(model1,h=12)

plot(forecast.hotel)

Summary

Through this analysis, we were able to answer key business questions which are stated below:

Is there a difference in volume of bookings and revenue across the two type of hotels?
Is the change driven by a time period such as a year or a month?
Is the cancellation rate higher than the booking rate?
Are the cancellations driven by any factors such as time period or hotel type?
What is the average revenue generated per booking?
Is the revenue and booking driven by specific customer type such as New Customers or Existing Customers?
Are there more new customers coming in as years progress? How is the growth rate?
What type of meal is preferred?
What type of rooms are preferred?
Are the bookings driven by the week of day?
For a new booking, based on certain attributes, can we predict if the booking will go through or get cancelled?
What is the revenue expected in the coming 12 months? When do I need to run promotion?

Methodology

Majority of the analysis was EDA which was digging one level deeper and getting the data to answer the above questions
To answer few questions I sliced the data across various cuts
Another methodology included using different lenses to view data across segments
I used classification tree to predict if the booking will go through
For estimating the revenue over the next 12 months, I used time-series modelling

Insights

The number of bookings seems to be high in 2016 while the bookings seem to be less in 2015 and 2017. This is majorly due to more data points being available in 2016
The ratio of bookings cancelled to the confirmed bookings seems high for City Hotels
We observe that the avg. # of bookings in a week is growing year over year
The growth is higher from 2015 to 2016 as compared to the jump from 2016 to 2017
The growth is higher from 2015 to 2016 as compared to the jump from 2016 to 2017
The revenue almost doubled in 2016 but it did not grow at the same rate from 2016 to 2017
We observe that the number of New Customers and Existing Customers increases drastically in 2016 and the rise is not that high in 2017
We observe that the New Customers tend to spend significantly lower than the Existing Customers
Majority of bookings are accompanied with a breakfast or dinner
There are few Full Board bookings. This gives an opportunity to the hotels to increase revenue from these type of meals by offering some discounts
The hotels need to have more of Type A room followed by other types
We can investigate the Type L rooms to conclude about the 4 bookings in these rooms
We observe that Weekday bookings are higher than the Weekend numbers. That is an interesting finding. We need to investigate our corporate bookings!

Implication to business

The above analysis has unraveled some key findings in terms of the hotel type, customer type, customer preferences and time preference
Based on the results of EDA, hotel can plan on targeting the new customers to increase spends and maintain good relations with the existing customers
The hotels have an insight on the meal preference and room preference which will help them price the commodities better
Some metrics depict that the growth from 2015 to 2016 was higher as compared to the following year. We can analyze the differences across years to identify the extraordinary performance and match it
Based on the classification tree, hotels can predict if the booking will convert or not. This will allow them to make a waiting list based on the chance of cancellations to make the complete use of the capacity
The revenue forecast will help them plan promotions during the low revenue period

Limitations

The data for 2015 and 2017 is for different months. Even though we have converted them to same base line of weekly numbers, there are chances that some weeks perform differently as compared to other weeks
The definition of new customers is not very well described. A new customer this year will be existing next year, or they can be existing customer from the 2nd booking. A deeper analysis in required based on definition
The weekday vs weekend analysis can be further drilled for the type of bookings
The classification model uses few variables. We can tune the model with new variables and adjusting the cost of misclassification. Additionally, we need to split the data into train and validation to avoid overfitting
The forecasting aspect can be further drilled to analyze the residuals and split the model across years and various factors such as customer type or hotel type