Authors: Pallabi Das | Trishita Aditya | Parul Ranjan

Introduction

Introduction

The hospitality industry is competitive, and businesses need to keep up with the latest hospitality trends to avoid being left behind. However, the needs and priorities of customers have also changed due to COVID. Keeping pace with the industry is a great way to ensure your business delivers the kind of customer experience people want and expect. This includes everything from marketing efforts and hospitality processes to embracing the latest technology and responding to global events. And one of the best ways to do so is by analyzing the real-time hotel booking data set to gain insights about the change in customer booking trends and based on those businesses can come up with the best strategies to deliver the best customer experience.

Problem Statement

With our analysis we try to provide insights into effective revenue management by predicting with is the busiest time for the hotel industry so that they can best prepare their services and secondly we will also try to find out which type of customers add more to the revenue.

The data set which we have chosen to do analysis contains hotel demand data of two hotels in Portugal: Resort Hotel of Algarve and City Hotel of Lisbon. This data set has 31 variables describing 40,060 observations of Resort hotel and 79,330 observations of City Hotel.

Following are the problem statements that we are going to answer with our analysis:

  1. To visualize which year, which month and among weekday / weekend nights have the maximum booking , to understand what is the best time to generate revenue flow in the business.
  2. To understand which type of customer, for e.g., new customers or existing customers, customers belonging to certain age groups or customers having family, etc., add more revenue to the business.
  3. To understand what are the factors leading to cancellations of booking affecting the revenue generation of the hotels.

Analytic Technique

We start with data cleaning, like removing duplicate values, identifying null values, and then imputing them or removing them as per requirement, and removing trivial data if found. We will also do quality assurance to convert our raw data into appropriate formats. We do all this cleaning so that any insignificant data does not produce any irrelevant result leading to wrong analysis. Our second step is going to be exploratory data analysis – summary statistics, finding outliers, correlations between variables, transforming our data analysis into data visualizations which will help us to summarize the data so we can easily spot trends, patterns, and relationships in the data set, etc. Lastly, we will create our model to predict the busiest time for the booking, the customer groups generating more revenue and factors leading to cancellations.

Our agenda is to determine the busiest time for the bookings in both the hotels, find the customer type leading to more revenue generation and the booking cancellation reasons. This helps in determining the factors which can be emphasized by the hotels to provide the best customer experience. We need to find the dependency of variables on each other and variables affecting our problem statements, in the data set. Once we have determined the variables, we can apply any algorithm like a decision tree to predict the busiest time or customer type by training our model with the current data. Additionally, we will also try to find out the revenue generation of each hotel and the factors which drive it.

Consumer Benefit

With our analysis we will be able to find out the busiest time for hotel bookings, this will help the hotel industry to provide the best offers on bookings, also help in keeping the staff ready and well equipped to handle the high load so that the hotel operation runs smoothly at this time, providing the best customer experience. This will lead to customers’ good feedback and increase the brand value. Also finding which type of customer does a greater number of bookings, will help provide them with the best offers and services to enhance their experience at the hotel making them loyal to the hotel. All these factors will improve revenue generation with the right marketing strategies. By knowing the reasons for cancellation, hotels can emphasize on the factors and improve their business so that it leads to decrease in booking cancellations and increased revenue generation.

Package Requirements

Packages Used

We will be using the following packages in our code:

options(warn = -1)
library(tidyverse) #provide key data transformation functions in a single package.
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(knitr) #For report generation
library(kableExtra) #For building tables
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(dplyr) #For data manipulation
library(corrplot) #For checking the correlation between different attributes
## corrplot 0.92 loaded
library(ggplot2) #For data visualization
library(psych) #Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics.
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(sqldf) #sqldf() transparently sets up a database, imports the data frames into that database, performs the SQL select or other statement and returns the result using a heuristic to determine which class to assign to each column of the returned data frame.
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(corrplot) #provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
library(DT) #provides an R interface to the JavaScript library DataTables. R data objects (matrices or data frames) can be displayed as tables on HTML pages, and DataTables provides filtering, pagination, sorting, and many other features in the tables.
library(rmarkdown) ##used for formatting the markdown file

Data Preparation

Data Source

The data comes from an open hotel booking demand data set from Antonio, Almeida and Nunes, 2019 available at Github.

Data Description

This data set describes hotel demand data. One of the hotels is a resort hotel and the other is a city hotel. The data set contains 31 variables describing the 40,060 observations of resort hotel and 79,330 observations of city hotel. Each observation represents a hotel booking. the data set comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted.

Original Purpose: The data set was made aiming at the development of prediction models to classify a hotel booking׳s likelihood to be canceled. Nevertheless, due to the characteristics of the variables included in the data set, their use goes beyond this cancellation prediction problem.

Data set History: In order to prevent leakage of future information, the timestamp of the target variable must occur after the input variables’ timestamp. Thus, instead of directly extracting variables from the bookings database table, when available, the variables’ values were extracted from the bookings change log, with a timestamp relative to the day prior to arrival date (for all the bookings created before their arrival date).

Not all variables in this dataset come from the bookings or change log database tables. Some come from other tables, and some are engineered from different variables from different tables. A diagram presenting the PMS database tables from where variables were extracted is presented below.

Data was obtained directly from the hotels’ PMS databases’ servers by executing a TSQL query on SQL Server Studio Manager. This query first collected the value or ID (in the case of foreign keys) of each variable in the BO table. The BL table was then checked for any alteration with respect to the day prior to the arrival. If an alteration was found, the value used was the one present in the BL table. For all the variables holding values in related tables (like meals, distribution channels, nationalities or market segments), their related values were retrieved.

The PMS assured no missing data exists in its database tables. However, in some categorical variables like Agent or Company, “NULL” is presented as one of the categories. This should not be considered a missing value, but rather as “not applicable”. For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.

Data Import and Cleaning

Data Import

We have read the csv file and found out that there are 119390 rows and 32 columns in the data set.

options(warn = -1)
hotel_data <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#retrieving the full column specification for this data.
spec(hotel_data)
## cols(
##   hotel = col_character(),
##   is_canceled = col_double(),
##   lead_time = col_double(),
##   arrival_date_year = col_double(),
##   arrival_date_month = col_character(),
##   arrival_date_week_number = col_double(),
##   arrival_date_day_of_month = col_double(),
##   stays_in_weekend_nights = col_double(),
##   stays_in_week_nights = col_double(),
##   adults = col_double(),
##   children = col_double(),
##   babies = col_double(),
##   meal = col_character(),
##   country = col_character(),
##   market_segment = col_character(),
##   distribution_channel = col_character(),
##   is_repeated_guest = col_double(),
##   previous_cancellations = col_double(),
##   previous_bookings_not_canceled = col_double(),
##   reserved_room_type = col_character(),
##   assigned_room_type = col_character(),
##   booking_changes = col_double(),
##   deposit_type = col_character(),
##   agent = col_character(),
##   company = col_character(),
##   days_in_waiting_list = col_double(),
##   customer_type = col_character(),
##   adr = col_double(),
##   required_car_parking_spaces = col_double(),
##   total_of_special_requests = col_double(),
##   reservation_status = col_character(),
##   reservation_status_date = col_date(format = "")
## )
Data Cleaning

Checking for duplicate records: Sometimes, we have duplicate records present in the data set which causes sampling bias and hence, we remove any duplicate records present in the data set.

#removing any duplicate records in the dataset
hotel_data <- unique(hotel_data)

The total unique observations has become 87396 which shows that 26.8% of the records were duplicate which were dropped off.

Checking for missing records records:

We check for missing records as it might generate wrong analysis results. Based on the situation we either remove it or impute it.

#Checking for total no of missing values per column
colSums(is.na(hotel_data))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

We found out that there are 4 missing records for children column. We will impute the missing records with the median value of the column.

#Imputing the median for missing values for children column
hotel_data$children[is.na(hotel_data$children)] <- median(hotel_data$children,na.rm = T)

Checking for string inconsistencies:

We now check our data set for any structural errors, so that the data is consistant throughout the data set.

#checking for String inconsistencies
unique(hotel_data$hotel)               
## [1] "Resort Hotel" "City Hotel"
unique(hotel_data$arrival_date_month)  
##  [1] "July"      "August"    "September" "October"   "November"  "December" 
##  [7] "January"   "February"  "March"     "April"     "May"       "June"
unique(hotel_data$meal)                
## [1] "BB"        "FB"        "HB"        "SC"        "Undefined"
unique(hotel_data$market_segment)      
## [1] "Direct"        "Corporate"     "Online TA"     "Offline TA/TO"
## [5] "Complementary" "Groups"        "Undefined"     "Aviation"
unique(hotel_data$distribution_channel)
## [1] "Direct"    "Corporate" "TA/TO"     "Undefined" "GDS"
unique(hotel_data$reserved_room_type)  
##  [1] "C" "A" "D" "E" "G" "F" "H" "L" "P" "B"
unique(hotel_data$assigned_room_type)  
##  [1] "C" "A" "D" "E" "G" "F" "I" "B" "H" "P" "L" "K"
unique(hotel_data$deposit_type)        
## [1] "No Deposit" "Refundable" "Non Refund"
unique(hotel_data$customer_type)       
## [1] "Transient"       "Contract"        "Transient-Party" "Group"
unique(hotel_data$reservation_status)
## [1] "Check-Out" "Canceled"  "No-Show"

We find that in meal column both undefined and SC means no meal package, so we replace undefined with SC

#replacing undefined with SC
hotel_data$meal[hotel_data$meal == "Undefined"] <- "SC"
unique(hotel_data$meal)
## [1] "BB" "FB" "HB" "SC"

Checking for negative values:

We are checking for negative values in our data set to ensure the correctness of the data.

#checking for negative values
nrow(subset(hotel_data, hotel_data$is_canceled < 0))                   
## [1] 0
nrow(subset(hotel_data, hotel_data$lead_time < 0))                     
## [1] 0
nrow(subset(hotel_data, hotel_data$arrival_date_year < 0))             
## [1] 0
nrow(subset(hotel_data, hotel_data$arrival_date_week_number < 0))      
## [1] 0
nrow(subset(hotel_data, hotel_data$arrival_date_day_of_month < 0))     
## [1] 0
nrow(subset(hotel_data, hotel_data$stays_in_weekend_nights < 0))       
## [1] 0
nrow(subset(hotel_data, hotel_data$stays_in_week_nights < 0))          
## [1] 0
nrow(subset(hotel_data, hotel_data$adults < 0))                        
## [1] 0
nrow(subset(hotel_data, hotel_data$children < 0))                      
## [1] 0
nrow(subset(hotel_data, hotel_data$babies < 0))                        
## [1] 0
nrow(subset(hotel_data, hotel_data$is_repeated_guest < 0))             
## [1] 0
nrow(subset(hotel_data, hotel_data$previous_cancellations < 0))       
## [1] 0
nrow(subset(hotel_data, hotel_data$previous_bookings_not_canceled < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$booking_changes < 0))               
## [1] 0
nrow(subset(hotel_data, hotel_data$days_in_waiting_list < 0))         
## [1] 0
nrow(subset(hotel_data, hotel_data$adr < 0))                           
## [1] 1
nrow(subset(hotel_data, hotel_data$required_car_parking_spaces < 0))   
## [1] 0
nrow(subset(hotel_data, hotel_data$total_of_special_requests < 0))   
## [1] 0

We can see that there is only 1 row having negative adr which makes no sense because if a room is booked, it will have some positive or zero (in case of No Deposit) average daily rate. We can drop off these records for our further analysis.

#We find that adr has one negative value, so we remove that row
hotel_data <- hotel_data %>% filter(hotel_data$adr > 0 | hotel_data$adr == 0)

Checking for irrelevant bookings We will check for booking with no adults, children and babies.

# Finding number of records when no adult, children or babies checked in
nrow(subset(hotel_data, hotel_data$adults == 0 & hotel_data$children == 0 & hotel_data$babies == 0))
## [1] 166

There 166 records showing with no person being checked in. So we will drop off these records.

#Removing 166 records where the booking is under 0 adult and 0 children and 0 babies
hotel_data <- hotel_data %>% filter(hotel_data$adults != 0 | hotel_data$children != 0 | hotel_data$babies != 0)

We will also check for booking under just babies because this seems impractical. Children of age 17 can still do bookings without being accompanied by adults.

# Finding number of records with only babies checked in
nrow(subset(hotel_data, hotel_data$adults == 0 & hotel_data$children == 0 & hotel_data$babies > 0))
## [1] 0

There are 0 records of just babies being checked in.

Variable datatypes cleaning

We have observed from the structure that all the data types for the variables are appropriate and no further changes are needed.

Checking for Null values in the data set

#checking for null value for company
nrow(subset(hotel_data, hotel_data$company == 'NULL'))
## [1] 81992
#checking for null value for agent
nrow(subset(hotel_data, hotel_data$agent == 'NULL'))
## [1] 12141

There are 81992 and 12141 null values for Company and Agent column respectively. This should not be considered a missing value, but rather as “not applicable”. For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.

Therefore, we will replace the NULL values with Booking not done by company and Booking not done by Agent for Company and Agent column respectively.

#replacing null values in company
hotel_data$company[hotel_data$company == "NULL"] <- "Booking not done by company"

#replacing null values in agent
hotel_data$agent[hotel_data$agent == "NULL"] <- "Booking not done by agent"

Checking for Outliers:

Finally, we will find out the outliers present in the data. These outliers may be due to incorrect data or due to the extreme values. We are going to find out the outliers of each numeric column to determine whether to keep or remove those outliers.

We have used boxplots to determine the outliers of each numeric column.

# Creating boxplot of all  numeric variables

boxplot(hotel_data$lead_time, ylab = "lead_time",main = "Boxplot of lead_time")

boxplot(hotel_data$arrival_date_week_number, ylab = "arrival_date_week_number",main = "Boxplot of arrival_date_week_number")

boxplot(hotel_data$arrival_date_year, ylab = "arrival_date_year",main = "Boxplot of arrival_date_year")

boxplot(hotel_data$arrival_date_day_of_month, ylab = "arrival_date_day_of_month",main = "Boxplot of arrival_date_day_of_month")

boxplot(hotel_data$stays_in_weekend_nights, ylab = "stays_in_weekend_nights",main = "Boxplot of stays_in_weekend_nights")

boxplot(hotel_data$stays_in_week_nights, ylab = "stays_in_week_nights",main = "Boxplot of stays_in_week_nights")

boxplot(hotel_data$adults, ylab = "adults",main = "Boxplot of adults")

boxplot(hotel_data$children, ylab = "children",main = "Boxplot of children")

boxplot(hotel_data$babies, ylab = "babies",main = "Boxplot of babies")

boxplot(hotel_data$previous_cancellations, ylab = "previous_cancellations",main = "Boxplot of previous_cancellations")

boxplot(hotel_data$previous_bookings_not_canceled, ylab = "previous_bookings_not_canceled",main = "Boxplot of previous_bookings_not_canceled")

boxplot(hotel_data$booking_changes, ylab = "booking_changes",main = "Boxplot of booking_changes")

boxplot(hotel_data$days_in_waiting_list, ylab = "days_in_waiting_list",main = "Boxplot of days_in_waiting_list")

boxplot(hotel_data$adr, ylab = "adr",main = "Boxplot of Average_Daily_Rate")

boxplot(hotel_data$required_car_parking_spaces, ylab = "required_car_parking_spaces",main = "Boxplot of required_car_parking_spaces")

boxplot(hotel_data$total_of_special_requests, ylab = "total_of_special_requests",main = "Boxplot of total_of_special_requests")

From the graphs above, we can visualize that there are outliers present for all numeric columns except arrival_date_week_number and arrival_date_day_of_month column. These outliers seems to be something originating from extreme values and not due to error. Thus, we will keep them in our dataset.

Looking at the structure of our data set after cleaning

After doing all the above steps for data cleaning we will now look at the structure of our cleaned data set.

#examining the structure of the data set
str(hotel_data)
## tibble [87,229 x 32] (S3: tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:87229] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:87229] 0 0 0 0 0 0 0 1 1 1 ...
##  $ lead_time                     : num [1:87229] 342 737 7 13 14 0 9 85 75 23 ...
##  $ arrival_date_year             : num [1:87229] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:87229] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:87229] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:87229] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:87229] 0 0 1 1 2 2 2 3 3 4 ...
##  $ adults                        : num [1:87229] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:87229] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:87229] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:87229] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:87229] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:87229] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:87229] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:87229] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:87229] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:87229] "Booking not done by agent" "Booking not done by agent" "Booking not done by agent" "304" ...
##  $ company                       : chr [1:87229] "Booking not done by company" "Booking not done by company" "Booking not done by company" "Booking not done by company" ...
##  $ days_in_waiting_list          : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:87229] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:87229] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:87229] 0 0 0 0 1 0 1 1 0 0 ...
##  $ reservation_status            : chr [1:87229] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:87229], format: "2015-07-01" "2015-07-01" ...

Final Data Set

We now have a clean data set with 87,229 observations and 32 variables with which we can proceed for our data analysis. Let’s have a look at our data set in the condensed form:

kable(head(hotel_data, 32), booktabs = TRUE) %>%
  kable_styling(font_size = 12) 
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country market_segment distribution_channel is_repeated_guest previous_cancellations previous_bookings_not_canceled reserved_room_type assigned_room_type booking_changes deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
Resort Hotel 0 342 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 3 No Deposit Booking not done by agent Booking not done by company 0 Transient 0.00 0 0 Check-Out 2015-07-01
Resort Hotel 0 737 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 4 No Deposit Booking not done by agent Booking not done by company 0 Transient 0.00 0 0 Check-Out 2015-07-01
Resort Hotel 0 7 2015 July 27 1 0 1 1 0 0 BB GBR Direct Direct 0 0 0 A C 0 No Deposit Booking not done by agent Booking not done by company 0 Transient 75.00 0 0 Check-Out 2015-07-02
Resort Hotel 0 13 2015 July 27 1 0 1 1 0 0 BB GBR Corporate Corporate 0 0 0 A A 0 No Deposit 304 Booking not done by company 0 Transient 75.00 0 0 Check-Out 2015-07-02
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 Booking not done by company 0 Transient 98.00 0 1 Check-Out 2015-07-03
Resort Hotel 0 0 2015 July 27 1 0 2 2 0 0 BB PRT Direct Direct 0 0 0 C C 0 No Deposit Booking not done by agent Booking not done by company 0 Transient 107.00 0 0 Check-Out 2015-07-03
Resort Hotel 0 9 2015 July 27 1 0 2 2 0 0 FB PRT Direct Direct 0 0 0 C C 0 No Deposit 303 Booking not done by company 0 Transient 103.00 0 1 Check-Out 2015-07-03
Resort Hotel 1 85 2015 July 27 1 0 3 2 0 0 BB PRT Online TA TA/TO 0 0 0 A A 0 No Deposit 240 Booking not done by company 0 Transient 82.00 0 1 Canceled 2015-05-06
Resort Hotel 1 75 2015 July 27 1 0 3 2 0 0 HB PRT Offline TA/TO TA/TO 0 0 0 D D 0 No Deposit 15 Booking not done by company 0 Transient 105.50 0 0 Canceled 2015-04-22
Resort Hotel 1 23 2015 July 27 1 0 4 2 0 0 BB PRT Online TA TA/TO 0 0 0 E E 0 No Deposit 240 Booking not done by company 0 Transient 123.00 0 0 Canceled 2015-06-23
Resort Hotel 0 35 2015 July 27 1 0 4 2 0 0 HB PRT Online TA TA/TO 0 0 0 D D 0 No Deposit 240 Booking not done by company 0 Transient 145.00 0 0 Check-Out 2015-07-05
Resort Hotel 0 68 2015 July 27 1 0 4 2 0 0 BB USA Online TA TA/TO 0 0 0 D E 0 No Deposit 240 Booking not done by company 0 Transient 97.00 0 3 Check-Out 2015-07-05
Resort Hotel 0 18 2015 July 27 1 0 4 2 1 0 HB ESP Online TA TA/TO 0 0 0 G G 1 No Deposit 241 Booking not done by company 0 Transient 154.77 0 1 Check-Out 2015-07-05
Resort Hotel 0 37 2015 July 27 1 0 4 2 0 0 BB PRT Online TA TA/TO 0 0 0 E E 0 No Deposit 241 Booking not done by company 0 Transient 94.71 0 0 Check-Out 2015-07-05
Resort Hotel 0 68 2015 July 27 1 0 4 2 0 0 BB IRL Online TA TA/TO 0 0 0 D E 0 No Deposit 240 Booking not done by company 0 Transient 97.00 0 3 Check-Out 2015-07-05
Resort Hotel 0 37 2015 July 27 1 0 4 2 0 0 BB PRT Offline TA/TO TA/TO 0 0 0 E E 0 No Deposit 8 Booking not done by company 0 Contract 97.50 0 0 Check-Out 2015-07-05
Resort Hotel 0 12 2015 July 27 1 0 1 2 0 0 BB IRL Online TA TA/TO 0 0 0 A E 0 No Deposit 240 Booking not done by company 0 Transient 88.20 0 0 Check-Out 2015-07-02
Resort Hotel 0 0 2015 July 27 1 0 1 2 0 0 BB FRA Corporate Corporate 0 0 0 A G 0 No Deposit Booking not done by agent 110 0 Transient 107.42 0 0 Check-Out 2015-07-02
Resort Hotel 0 7 2015 July 27 1 0 4 2 0 0 BB GBR Direct Direct 0 0 0 G G 0 No Deposit 250 Booking not done by company 0 Transient 153.00 0 1 Check-Out 2015-07-05
Resort Hotel 0 37 2015 July 27 1 1 4 1 0 0 BB GBR Online TA TA/TO 0 0 0 F F 0 No Deposit 241 Booking not done by company 0 Transient 97.29 0 1 Check-Out 2015-07-06
Resort Hotel 0 72 2015 July 27 1 2 4 2 0 0 BB PRT Direct Direct 0 0 0 A A 1 No Deposit 250 Booking not done by company 0 Transient 84.67 0 1 Check-Out 2015-07-07
Resort Hotel 0 72 2015 July 27 1 2 4 2 0 0 BB PRT Direct Direct 0 0 0 D D 1 No Deposit 250 Booking not done by company 0 Transient 99.67 0 1 Check-Out 2015-07-07
Resort Hotel 0 127 2015 July 27 1 2 5 2 0 0 HB GBR Offline TA/TO TA/TO 0 0 0 D I 0 No Deposit 115 Booking not done by company 0 Contract 94.95 0 1 Check-Out 2015-07-01
Resort Hotel 0 78 2015 July 27 1 2 5 2 0 0 BB PRT Offline TA/TO TA/TO 0 0 0 D D 0 No Deposit 5 Booking not done by company 0 Transient 63.60 1 0 Check-Out 2015-07-08
Resort Hotel 0 48 2015 July 27 1 2 5 2 0 0 BB IRL Offline TA/TO TA/TO 0 0 0 D D 0 No Deposit 8 Booking not done by company 0 Contract 79.50 0 0 Check-Out 2015-07-08
Resort Hotel 1 60 2015 July 27 1 2 5 2 0 0 BB PRT Online TA TA/TO 0 0 0 E E 0 No Deposit 240 Booking not done by company 0 Transient 107.00 0 2 Canceled 2015-05-11
Resort Hotel 0 77 2015 July 27 1 2 5 2 0 0 BB PRT Online TA TA/TO 0 0 0 A A 0 No Deposit 240 Booking not done by company 0 Transient 94.00 0 0 Check-Out 2015-07-08
Resort Hotel 0 99 2015 July 27 1 2 5 2 0 0 BB PRT Online TA TA/TO 0 0 0 D D 0 No Deposit 240 Booking not done by company 0 Transient 87.30 1 1 Check-Out 2015-07-08
Resort Hotel 0 118 2015 July 27 1 4 10 1 0 0 BB NULL Direct Direct 0 0 0 A A 2 No Deposit Booking not done by agent Booking not done by company 0 Transient 62.00 0 2 Check-Out 2015-07-15
Resort Hotel 0 95 2015 July 27 1 4 11 2 0 0 BB GBR Offline TA/TO TA/TO 0 0 0 D D 0 No Deposit 241 Booking not done by company 0 Transient 63.86 0 0 Check-Out 2015-07-16
Resort Hotel 1 96 2015 July 27 1 2 8 2 0 0 BB PRT Direct Direct 0 0 0 E E 0 No Deposit Booking not done by agent Booking not done by company 0 Transient 108.30 0 2 Canceled 2015-05-29
Resort Hotel 0 69 2015 July 27 2 2 4 2 0 0 BB IRL Offline TA/TO TA/TO 0 0 0 A C 0 No Deposit 175 Booking not done by company 0 Transient 65.50 0 0 Check-Out 2015-07-08

Descriptive statistics for variables

A detailed description of the extracted variables, their origin, and the engineering procedures employed in its creation is shown in the below table. The names of the tables can be referred from Diagram.1 in Data Description tab.

options(warn = -1)

#Preparing a table to show information regarding each variable in the data set
Variable_Name <- c("ADR","Adults","Agent","ArrivalDateDayOfMonth","ArrivalDateMonth","ArrivalDateWeekNumber","ArrivalDateYear","AssignedRoomType","Babies","BookingChanges","Children","Company","Country","CustomerType","DaysInWaitingList","DepositType","DistributionChannel","IsCanceled","IsRepeatedGuest","LeadTime","MarketSegment","Meal","PreviousBookingsNotCanceled","PreviousCancellations","RequiredCardParkingSpaces","ReservationStatus","ReservationStatusDate","ReservedRoomType","StaysInWeekendNights","StaysInWeekNights","TotalOfSpecialRequests")

Variable_Type <- c("Numeric","Integer","Categorical","Integer","Categorical","Integer","Integer","Categorical","Integer","Integer","Integer","Categorical","Categorical","Categorical","Integer","Categorical","Categorical","Categorical","Categorical","Integer","Categorical","Categorical","Integer","Integer","Integer","Categorical","Date","Categorical","Integer","Integer","Integer")

Variable_Description <- c("Average Daily Rate as defined",
"Number of adults",
"ID of the travel agency that made the bookinga",
"Day of the month of the arrival date",
"Month of arrival date with 12 categories: “January” to “December”",
"Week number of the arrival date",
"Year of arrival date",
"Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons",
"Number of babies",
"Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation",
"Number of children",
"ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons",
"Country of origin. Categories are represented in the ISO 3155–3:2013 format",
"Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking",
"Number of days the booking was in the waiting list before it was confirmed to the customer",
"Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.",
"Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
"Value indicating if the booking was canceled (1) or not (0)",
"Value indicating if the booking name was from a repeated guest (1) or not (0)",
"Number of days that elapsed between the entering date of the booking into the PMS and the arrival date",
"Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
"Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)",
"Number of previous bookings not cancelled by the customer prior to the current booking",
"Number of previous bookings that were cancelled by the customer prior to the current booking",
"Number of car parking spaces required by the customer",
"Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why",
"Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel",
"Code of room type reserved. Code is presented instead of designation for anonymity reasons",
"Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel",
"Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel",
"Number of special requests made by the customer (e.g. twin bed or high floor)")

Variable_Source <- c("BO, BL and TR / Calculated by dividing the sum of all lodging transactions by the total number of staying nights",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL/Calculated by adding the number of unique iterations that change some of the booking attributes, namely: persons, arrival date, nights, reserved room type or meal",
"BO and BL/Sum of both payable and non-payable children",
"BO and BL.",
"BO, BL and NT",
"BO and BL",
"BO/Calculated by subtracting the date the booking was confirmed to the customer from the date the booking entered on the PMS",
"BO and TR/Value calculated based on the payments identified for the booking in the transaction (TR) table before the booking׳s arrival or cancellation date. In case no payments were found the value is “No Deposit”. If the payment was equal or exceeded the total cost of stay, the value is set as “Non Refund”. Otherwise the value is set as “Refundable”",
"BO, BL and DC",
"BO",
"BO, BL and C/ Variable created by verifying if a profile was associated with the booking customer. If so, and if the customer profile creation date was prior to the creation date for the booking on the PMS database it was assumed the booking was from a repeated guest",
"BO and BL/ Subtraction of the entering date from the arrival date",
"BO, BL and MS",
"BO, BL and ML",
"BO and BL / In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and not canceled.",
"BO and BL/ In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and canceled.",
"BO and BL",
"BO",
"BO",
"BO and BL",
"BO and BL/ Calculated by counting the number of weekend nights from the total number of nights",
"BO and BL/Calculated by counting the number of week nights from the total number of nights",
"BO and BL/Sum of all special requests")

Data_Description <- as_data_frame(cbind(Variable_Name, Variable_Type, Variable_Description, Variable_Source))

DT::datatable(head(Data_Description,35))

We will now do a summary analysis of each variable in the data set using describe().

#obtaining summary statistics
describe(hotel_data)
##                                vars     n    mean     sd median trimmed   mad
## hotel*                            1 87229    1.39   0.49    1.0    1.36  0.00
## is_canceled                       2 87229    0.28   0.45    0.0    0.22  0.00
## lead_time                         3 87229   79.97  86.06   49.0   65.96 65.23
## arrival_date_year                 4 87229 2016.21   0.69 2016.0 2016.26  1.48
## arrival_date_month*               5 87229    6.29   3.47    6.0    6.26  4.45
## arrival_date_week_number          6 87229   26.84  13.67   27.0   26.76 16.31
## arrival_date_day_of_month         7 87229   15.82   8.84   16.0   15.82 11.86
## stays_in_weekend_nights           8 87229    1.00   1.03    1.0    0.92  1.48
## stays_in_week_nights              9 87229    2.62   2.04    2.0    2.41  1.48
## adults                           10 87229    1.88   0.62    2.0    1.89  0.00
## children                         11 87229    0.14   0.46    0.0    0.00  0.00
## babies                           12 87229    0.01   0.11    0.0    0.00  0.00
## meal*                            13 87229    1.55   1.07    1.0    1.32  0.00
## country*                         14 87229   88.16  45.11   77.0   88.92 66.72
## market_segment*                  15 87229    6.08   1.34    7.0    6.31  0.00
## distribution_channel*            16 87229    3.53   0.95    4.0    3.73  0.00
## is_repeated_guest                17 87229    0.04   0.19    0.0    0.00  0.00
## previous_cancellations           18 87229    0.03   0.37    0.0    0.00  0.00
## previous_bookings_not_canceled   19 87229    0.18   1.73    0.0    0.00  0.00
## reserved_room_type*              20 87229    2.26   1.83    1.0    1.95  0.00
## assigned_room_type*              21 87229    2.66   1.97    1.0    2.39  0.00
## booking_changes                  22 87229    0.27   0.71    0.0    0.10  0.00
## deposit_type*                    23 87229    1.01   0.13    1.0    1.00  0.00
## agent*                           24 87229  226.80 116.81  315.0  238.23 28.17
## company*                         25 87229  337.58  51.21  349.0  349.00  0.00
## days_in_waiting_list             26 87229    0.75  10.00    0.0    0.00  0.00
## customer_type*                   27 87229    3.06   0.53    3.0    3.04  0.00
## adr                              28 87229  106.52  54.89   98.2  102.49 44.85
## required_car_parking_spaces      29 87229    0.08   0.28    0.0    0.00  0.00
## total_of_special_requests        30 87229    0.70   0.83    0.0    0.58  0.00
## reservation_status*              31 87229    1.75   0.46    2.0    1.80  0.00
## reservation_status_date          32 87229     NaN     NA     NA     NaN    NA
##                                 min  max range  skew kurtosis   se
## hotel*                            1    2     1  0.45    -1.79 0.00
## is_canceled                       0    1     1  1.01    -0.99 0.00
## lead_time                         0  737   737  1.43     2.12 0.29
## arrival_date_year              2015 2017     2 -0.30    -0.88 0.00
## arrival_date_month*               1   12    11  0.02    -1.20 0.01
## arrival_date_week_number          1   53    52  0.02    -0.95 0.05
## arrival_date_day_of_month         1   31    30  0.00    -1.20 0.03
## stays_in_weekend_nights           0   19    19  1.34     7.07 0.00
## stays_in_week_nights              0   50    50  2.58    20.10 0.01
## adults                            0   55    55 20.39  1391.01 0.00
## children                          0   10    10  3.46    12.92 0.00
## babies                            0   10    10 21.13  1202.71 0.00
## meal*                             1    4     3  1.53     0.57 0.00
## country*                          1  178   177  0.12    -1.46 0.15
## market_segment*                   1    8     7 -1.27     0.44 0.00
## distribution_channel*             1    5     4 -1.63     1.02 0.00
## is_repeated_guest                 0    1     1  4.79    20.98 0.00
## previous_cancellations            0   26    26 34.33  1725.48 0.00
## previous_bookings_not_canceled    0   72    72 20.46   578.92 0.01
## reserved_room_type*               1    9     8  1.09    -0.07 0.01
## assigned_room_type*               1   11    10  0.81    -0.23 0.01
## booking_changes                   0   18    18  5.07    53.99 0.00
## deposit_type*                     1    3     2  9.82   106.95 0.00
## agent*                            1  334   333 -0.50    -1.49 0.40
## company*                          1  349   348 -4.75    22.01 0.17
## days_in_waiting_list              0  391   391 19.47   483.77 0.03
## customer_type*                    1    4     3 -1.39     6.59 0.00
## adr                               0 5400  5400 11.02   992.18 0.19
## required_car_parking_spaces       0    8     8  3.49    21.71 0.00
## total_of_special_requests         0    5     5  1.08     0.82 0.00
## reservation_status*               1    3     2 -0.78    -0.46 0.00
## reservation_status_date         Inf -Inf  -Inf    NA       NA   NA

Exploratory Data Analysis

Data Analysis and Visualization

Following are the problem statements which we are trying to solve with our analysis:

  1. To visualize which year, which month and among weekday / weekend nights have the maximum booking , to understand what is the best time to generate revenue flow in the business.
  2. To understand which type of customer, for e.g., new customers or existing customers add more revenue to the business.
  3. To understand what are the factors leading to cancellations of booking affecting the revenue generation of the hotels.

All the three problem statement is to help improve the revenue generation of the hotel business.

Let’s start with the analysis!

Uncovering Information:

Data analysis to find a solution to our problems can be done using plots and graphs which has been implemented below by slicing the data, creating new dataframes, splitting column values

We first would prepare data to find the busiest time for hotel booking.

We split the column reservation_status_date into three new columns month,year, and day. This will help us to do analysis on our first problem statement.

# Sub-setting the data with only reservation_status_date and reservation_status #

reservation_date <- hotel_data[,c(31,32)]
head(reservation_date)
## # A tibble: 6 x 2
##   reservation_status reservation_status_date
##   <chr>              <date>                 
## 1 Check-Out          2015-07-01             
## 2 Check-Out          2015-07-01             
## 3 Check-Out          2015-07-02             
## 4 Check-Out          2015-07-02             
## 5 Check-Out          2015-07-03             
## 6 Check-Out          2015-07-03
# Splitting the reservation_status_date column into 3 columns based on Year , Month & Day #

reservation_date <- separate(reservation_date,reservation_status_date,c("year", "month","day"), sep = "-")
head(reservation_date)
## # A tibble: 6 x 4
##   reservation_status year  month day  
##   <chr>              <chr> <chr> <chr>
## 1 Check-Out          2015  07    01   
## 2 Check-Out          2015  07    01   
## 3 Check-Out          2015  07    02   
## 4 Check-Out          2015  07    02   
## 5 Check-Out          2015  07    03   
## 6 Check-Out          2015  07    03
# Analyzing Count of Booking per Year #

yearwise_data <- sqldf("select count(reservation_status) as [No_of_bookings], year  from reservation_date group by year")

#plotting the bar graph for bookings based on year
booking_year <- ggplot(yearwise_data, aes(x = year, y = No_of_bookings)) +
                geom_bar(stat = "identity", fill = "lightblue") + ggtitle("No of Bookings vs Year")

booking_year

# Analyzing Count of Booking per Month #

monthwise_data <- sqldf("select count(reservation_status) as [No_of_bookings],month  from reservation_date group by month")

#plotting the bar graph for bookings based on month
booking_month <- ggplot(monthwise_data, aes(x = month, y = No_of_bookings)) +
  geom_bar(stat = "identity", fill = "lightpink") + ggtitle("# of bookings across months")

booking_month

From the above two graphs we find out that August has been the busiest month of the year with the highest number of booking and the year 2016 has seen the maximum number of bookings.

Now we will analyse the booking based on hotel type monthly and yearly.

#plotting graph for yearly bookings based on hotel type
ggplot(data = hotel_data,aes(arrival_date_year,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
  xlab("Years") +
  ggtitle("Hotel Type vs # of bookings across years") +
  labs(fill = 'Hotel Type')

We can see that for both the hotel type highest number of bookings has been in the year 2016.

#Rearranging the factors to arrange months
hotel_data <- hotel_data %>% 
  mutate(Months = fct_relevel(arrival_date_month,"January",
                              "February",
                              "March",
                              "April",
                              "May",
                              "June",
                              "July",
                              "August",
                              "September",
                              "October",
                              "November",
                              "December"))


#plotting the graph for monthly bookings based on hotel type
ggplot(data = hotel_data,aes(Months,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  facet_grid(hotel_data$arrival_date_year) +
  scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
  xlab("Months") +
  ggtitle("Hotel Type vs # of bookings across months") +
  labs(fill = 'Hotel Type')

For City Hotel in the year 2015 bookings have been maximum in September, 2016 in August and 2017 in May.

For Resort Hotel in the year bookings has been maximum in August for all the three years.

Now we proceed with finding the whether the bookings have been more for weekends or weekdays.

# Sub-setting the data with only reservation_status_date, stays_in_weekend_nights and  stays_in_week_nights #

reservation_day <- hotel_data[,c(8,9,32)]
head(reservation_day)
## # A tibble: 6 x 3
##   stays_in_weekend_nights stays_in_week_nights reservation_status_date
##                     <dbl>                <dbl> <date>                 
## 1                       0                    0 2015-07-01             
## 2                       0                    0 2015-07-01             
## 3                       0                    1 2015-07-02             
## 4                       0                    1 2015-07-02             
## 5                       0                    2 2015-07-03             
## 6                       0                    2 2015-07-03
# Splitting the reservation_status_date column into 3 columns based on Year , Month & Day #

reservation_day <- separate(reservation_day,reservation_status_date,c("year", "month","day"), sep = "-")
head(reservation_day)
## # A tibble: 6 x 5
##   stays_in_weekend_nights stays_in_week_nights year  month day  
##                     <dbl>                <dbl> <chr> <chr> <chr>
## 1                       0                    0 2015  07    01   
## 2                       0                    0 2015  07    01   
## 3                       0                    1 2015  07    02   
## 4                       0                    1 2015  07    02   
## 5                       0                    2 2015  07    03   
## 6                       0                    2 2015  07    03
#preparing table to analyze the difference between bookings for weekend nights and weekday nights #

weekwise_date  <- sqldf("select sum(stays_in_weekend_nights) as No_of_weekend_nights , sum(stays_in_week_nights) as No_of_weekday_nights,month
        from reservation_day group by month")

#plotting line chart for booking based on weekends and weekdays
colors <- c("No_of_weekend_nights" = "darkred", "No_of_weekday_nights" = "orange")


ggplot(weekwise_date, aes(x = month, group = 1)) +
  geom_line(aes(y = No_of_weekend_nights, color = "No_of_weekend_nights")) +
  geom_line(aes(y = No_of_weekday_nights, color = "No_of_weekday_nights")) + labs(y = "No of nights stayed", x = "Month", color = "WeekDay/Weekend Night") + scale_color_manual(values = colors) + ggtitle("# weekday/weekend nights stayed across Months")

#calculating the weekly summary
week_summary <- hotel_data %>%
  filter(is_canceled == 0) %>% 
  group_by(hotel,arrival_date_year) %>%
  summarize(avg.weekend.stay = round(sum(stays_in_weekend_nights)/2),
            avg.weekday.stay = round(sum(stays_in_week_nights)/5))
## `summarise()` has grouped output by 'hotel'. You can override using the `.groups` argument.
paged_table(week_summary)

We know that majority of holidays call for hotels and we expect a high volume of booking on Weekends as compared to Weekdays.

But from the above graph and summary table we see that number of bookings during the weekday is higher than the weekends, one of the reasons can be the corporate meetings and get together happen on Weekdays.

Now lets start with our second problem statement, to find the customer type helping in generating more revenue for the hotel.

# Total Number of cancellations for existing customers and new customers

hotel_data %>% 
  group_by(is_repeated_guest,is_canceled) %>% 
  summarise(length(is_canceled))
## `summarise()` has grouped output by 'is_repeated_guest'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups:   is_repeated_guest [2]
##   is_repeated_guest is_canceled `length(is_canceled)`
##               <dbl>       <dbl>                 <int>
## 1                 0           0                 60117
## 2                 0           1                 23749
## 3                 1           0                  3103
## 4                 1           1                   260

From above, it is clear that number of bookings cancelled for new customers is 23745 while that of existing customers is 260. Thus, it is evident that the existing customers tend to trust the hotels and thus add more revenue to the hotel.

Now lets analyze our final problem statement - factors leading to cancellation.

ggplot(data = hotel_data) +
    geom_bar(mapping = aes(x = is_canceled, fill = hotel)) +
    scale_fill_manual(values = c("#ffd700", "steelblue")) +
    labs(title = "Cancellation rate by hotel") +
    xlab("Is Canceled") +
    ylab("Count")

From the above graph we can conclude that the cancellation rate of city hotel is higher than the resort hotel.

# finding the effect of Lead Time on Booking Cancellation
hotel_data %>% 
  group_by(is_canceled) %>% 
  summarise(mean = mean(lead_time))
## # A tibble: 2 x 2
##   is_canceled  mean
##         <dbl> <dbl>
## 1           0  70.2
## 2           1 106.

From above result, we can see that when average lead time is 106, customers tend to cancel their bookings. However, bookings are not cancelled if average lead time is 70.

#finding the impact of Market Segment on Cancelled Bookings
hotel_data %>%
    ggplot(aes(is_canceled, fill = hotel_data$market_segment)) +
    geom_bar(position = "fill") +
    labs(title = "Behavior of Booking Status", 
         subtitle = hotel_data$market_segment,
         fill="Market Segment",
         x = "Booking Status", 
         y = "Percentage of Cancellation") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "blue"))

#finding the impact of Distribution Channel on Cancelled Bookings
hotel_data %>%
    ggplot(aes(is_canceled, fill = hotel_data$distribution_channel)) +
    geom_bar(position = "fill") +
    labs(title = "Behavior of Booking Status", 
         subtitle = hotel_data$distribution_channel,
         fill="Distribution Channel",
         x = "Booking Status", 
         y = "Percentage of Cancellation") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "red"))

From the above two graphs we can see that in market segment - Online TA and in distribution channel - TA/TO contribute more towards booking cancellation.

#finding the impact of previous Cancellation on Booking Cancellation

hotel_data %>% 
  group_by(is_canceled) %>% 
  summarise(mean = mean(previous_cancellations))
## # A tibble: 2 x 2
##   is_canceled   mean
##         <dbl>  <dbl>
## 1           0 0.0187
## 2           1 0.0613

From the above summary we can see people who have done cancellation previously are tend to cancel their current bookings as well.

#finding the impact of Wait Period on Booking Cancellation

hotel_data %>% 
  group_by(is_canceled) %>% 
  summarise(mean(days_in_waiting_list))
## # A tibble: 2 x 2
##   is_canceled `mean(days_in_waiting_list)`
##         <dbl>                        <dbl>
## 1           0                        0.717
## 2           1                        0.823

From above, we can see more cancellations are done when average wait period is more than 82 days.

To find the revenue generation of each hotel, we have created a variable to store revenue earned from each booking.

#adding a column showing revenue per booking

hotel_data <- hotel_data %>% 
  mutate(Revenue_Per_Booking = adr * (stays_in_weekend_nights + stays_in_week_nights))

dim(hotel_data)
## [1] 87229    34

Now we have 34 variables in our data set.

Let’s see the summary of the new variable added.

#Summary Statistics of Revenue_Per_Booking

summary(hotel_data$Revenue_Per_Booking)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   153.0   299.2   395.0   504.0  7590.0

The above summary indicates that the average revenue generated per booking is $395.

Linear Modelling

We are incorporating Linear Modeling to understand the impact of the other variables on Revenue.

To build the model , we would first prepare the data week wise. We would like to include Week number of arrival, hotel type, customer type, meal type, is_repeated_guest, total number of bookings, total revenue for the week, average adr and average stay nights.

#finding the total number of nights the guest stayed including weekday nights and weekend nights
hotel_data$total_stay_nights <- rowSums(cbind(hotel_data$stays_in_weekend_nights, hotel_data$stays_in_week_nights), na.rm = TRUE)

#finding the total revenue earned by each booking by multiplying total number of nights and average daily price

hotel_data <- hotel_data %>% mutate(revenue = (total_stay_nights * adr))

#preparing the model
data_model <- hotel_data %>% filter(is_canceled == 0) %>%  group_by(arrival_date_week_number, hotel, customer_type, meal, is_repeated_guest) %>% summarise(total_bookings = n(), total_guests = sum(adults + children + babies), average_stay_nights = mean(total_stay_nights, na.rm = TRUE), week_adr = mean(adr, na.rm = TRUE) ,total_revenue = sum(revenue))
## `summarise()` has grouped output by 'arrival_date_week_number', 'hotel', 'customer_type', 'meal'. You can override using the `.groups` argument.
model_1 <- lm(total_revenue ~ as.factor(hotel) + as.factor(customer_type) + as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + week_adr, data = data_model)
summary(model_1)
## 
## Call:
## lm(formula = total_revenue ~ as.factor(hotel) + as.factor(customer_type) + 
##     as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + 
##     week_adr, data = data_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52144 -19784  -3797   8512 238367 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               6809.60    3559.52   1.913   0.0560
## as.factor(hotel)Resort Hotel              -750.30    2092.62  -0.359   0.7200
## as.factor(customer_type)Group            -3445.48    3543.84  -0.972   0.3311
## as.factor(customer_type)Transient        39166.66    2980.50  13.141   <2e-16
## as.factor(customer_type)Transient-Party   7900.53    3066.95   2.576   0.0101
## as.factor(meal)FB                       -54489.48    4404.66 -12.371   <2e-16
## as.factor(meal)HB                       -30587.91    2409.70 -12.694   <2e-16
## as.factor(meal)SC                       -33358.52    2472.32 -13.493   <2e-16
## as.factor(is_repeated_guest)1           -30760.10    2220.00 -13.856   <2e-16
## average_stay_nights                        451.32     351.29   1.285   0.1991
## week_adr                                   196.51      20.76   9.467   <2e-16
##                                            
## (Intercept)                             .  
## as.factor(hotel)Resort Hotel               
## as.factor(customer_type)Group              
## as.factor(customer_type)Transient       ***
## as.factor(customer_type)Transient-Party *  
## as.factor(meal)FB                       ***
## as.factor(meal)HB                       ***
## as.factor(meal)SC                       ***
## as.factor(is_repeated_guest)1           ***
## average_stay_nights                        
## week_adr                                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33530 on 1334 degrees of freedom
## Multiple R-squared:  0.3675, Adjusted R-squared:  0.3628 
## F-statistic: 77.52 on 10 and 1334 DF,  p-value: < 2.2e-16

Results

  • The relationship between Revenue and Transient,Transient-Party type of customers, type of meals, week average daily rate are statistically significant.
  • The relationship between Revenue and hotel type, group type of customers, and guest repetition are not statistically significant.
  • Looking at the residual standard error we can deduce that 68% (1 standard deviation) of the predicted Revenue values will be within -+33530 of the real values. The value is large and we consider using other models in the future.
  • The R-squared value is 0.3675 and adjusted R-squared is 0.3628. These values suggest that it is comparatively low. It could be increased by adding more independent variables but then we also wanted to avoid overfitting of the model.
  • The overall F-test is statistically significant because the p value is smaller than .05.
  • So we conclude that the correlation between the model and the dependent variable is statistically significant overall.

Summary

Based on our exploratory data analysis, below are some of the insights we gained:

  • August has been the busiest month of the year with the highest number of booking and the year 2016 has seen the maximum number of bookings.

  • The number of bookings during the weekday is higher than the weekends, one of the reasons can be the corporate meetings and get together happen on Weekdays.

  • The existing customers tend to trust the hotels and thus add more revenue to the hotel.

  • The cancellation rate of city hotel is higher than the resort hotel.

  • When average lead time is more, customers tend to cancel their bookings.

  • In market segment - Online TA and in distribution channel - TA/TO contribute more towards booking cancellation.

  • People who have done more number of booking cancellation previously are tend to cancel their current bookings as well.

  • More cancellations are done when average wait period is more than 82 days.

  • The average revenue generated per booking is $395.

Based on our findings we can suggest that to increase the revenue hotel industry can pay more attention to factors such as average wait time, lead time, customer type, booking days, etc.