Authors: Pallabi Das | Trishita Aditya | Parul Ranjan
The hospitality industry is competitive, and businesses need to keep up with the latest hospitality trends to avoid being left behind. However, the needs and priorities of customers have also changed due to COVID. Keeping pace with the industry is a great way to ensure your business delivers the kind of customer experience people want and expect. This includes everything from marketing efforts and hospitality processes to embracing the latest technology and responding to global events. And one of the best ways to do so is by analyzing the real-time hotel booking data set to gain insights about the change in customer booking trends and based on those businesses can come up with the best strategies to deliver the best customer experience.
With our analysis we try to provide insights into effective revenue management by predicting with is the busiest time for the hotel industry so that they can best prepare their services and secondly we will also try to find out which type of customers add more to the revenue.
The data set which we have chosen to do analysis contains hotel demand data of two hotels in Portugal: Resort Hotel of Algarve and City Hotel of Lisbon. This data set has 31 variables describing 40,060 observations of Resort hotel and 79,330 observations of City Hotel.
Following are the problem statements that we are going to answer with our analysis:
We start with data cleaning, like removing duplicate values, identifying null values, and then imputing them or removing them as per requirement, and removing trivial data if found. We will also do quality assurance to convert our raw data into appropriate formats. We do all this cleaning so that any insignificant data does not produce any irrelevant result leading to wrong analysis. Our second step is going to be exploratory data analysis – summary statistics, finding outliers, correlations between variables, transforming our data analysis into data visualizations which will help us to summarize the data so we can easily spot trends, patterns, and relationships in the data set, etc. Lastly, we will create our model to predict the busiest time for the booking, the customer groups generating more revenue and factors leading to cancellations.
Our agenda is to determine the busiest time for the bookings in both the hotels, find the customer type leading to more revenue generation and the booking cancellation reasons. This helps in determining the factors which can be emphasized by the hotels to provide the best customer experience. We need to find the dependency of variables on each other and variables affecting our problem statements, in the data set. Once we have determined the variables, we can apply any algorithm like a decision tree to predict the busiest time or customer type by training our model with the current data. Additionally, we will also try to find out the revenue generation of each hotel and the factors which drive it.
With our analysis we will be able to find out the busiest time for hotel bookings, this will help the hotel industry to provide the best offers on bookings, also help in keeping the staff ready and well equipped to handle the high load so that the hotel operation runs smoothly at this time, providing the best customer experience. This will lead to customers’ good feedback and increase the brand value. Also finding which type of customer does a greater number of bookings, will help provide them with the best offers and services to enhance their experience at the hotel making them loyal to the hotel. All these factors will improve revenue generation with the right marketing strategies. By knowing the reasons for cancellation, hotels can emphasize on the factors and improve their business so that it leads to decrease in booking cancellations and increased revenue generation.
We will be using the following packages in our code:
options(warn = -1)
library(tidyverse) #provide key data transformation functions in a single package.
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr) #For report generation
library(kableExtra) #For building tables
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(dplyr) #For data manipulation
library(corrplot) #For checking the correlation between different attributes
## corrplot 0.92 loaded
library(ggplot2) #For data visualization
library(psych) #Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics.
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(sqldf) #sqldf() transparently sets up a database, imports the data frames into that database, performs the SQL select or other statement and returns the result using a heuristic to determine which class to assign to each column of the returned data frame.
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(corrplot) #provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
library(DT) #provides an R interface to the JavaScript library DataTables. R data objects (matrices or data frames) can be displayed as tables on HTML pages, and DataTables provides filtering, pagination, sorting, and many other features in the tables.
library(rmarkdown) ##used for formatting the markdown file
The data comes from an open hotel booking demand data set from Antonio, Almeida and Nunes, 2019 available at Github.
This data set describes hotel demand data. One of the hotels is a resort hotel and the other is a city hotel. The data set contains 31 variables describing the 40,060 observations of resort hotel and 79,330 observations of city hotel. Each observation represents a hotel booking. the data set comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted.
Original Purpose: The data set was made aiming at the development of prediction models to classify a hotel booking׳s likelihood to be canceled. Nevertheless, due to the characteristics of the variables included in the data set, their use goes beyond this cancellation prediction problem.
Data set History: In order to prevent leakage of future information, the timestamp of the target variable must occur after the input variables’ timestamp. Thus, instead of directly extracting variables from the bookings database table, when available, the variables’ values were extracted from the bookings change log, with a timestamp relative to the day prior to arrival date (for all the bookings created before their arrival date).
Not all variables in this dataset come from the bookings or change log database tables. Some come from other tables, and some are engineered from different variables from different tables. A diagram presenting the PMS database tables from where variables were extracted is presented below.
Data was obtained directly from the hotels’ PMS databases’ servers by executing a TSQL query on SQL Server Studio Manager. This query first collected the value or ID (in the case of foreign keys) of each variable in the BO table. The BL table was then checked for any alteration with respect to the day prior to the arrival. If an alteration was found, the value used was the one present in the BL table. For all the variables holding values in related tables (like meals, distribution channels, nationalities or market segments), their related values were retrieved.
The PMS assured no missing data exists in its database tables. However, in some categorical variables like Agent or Company, “NULL” is presented as one of the categories. This should not be considered a missing value, but rather as “not applicable”. For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.
We have read the csv file and found out that there are 119390 rows and 32 columns in the data set.
options(warn = -1)
hotel_data <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#retrieving the full column specification for this data.
spec(hotel_data)
## cols(
## hotel = col_character(),
## is_canceled = col_double(),
## lead_time = col_double(),
## arrival_date_year = col_double(),
## arrival_date_month = col_character(),
## arrival_date_week_number = col_double(),
## arrival_date_day_of_month = col_double(),
## stays_in_weekend_nights = col_double(),
## stays_in_week_nights = col_double(),
## adults = col_double(),
## children = col_double(),
## babies = col_double(),
## meal = col_character(),
## country = col_character(),
## market_segment = col_character(),
## distribution_channel = col_character(),
## is_repeated_guest = col_double(),
## previous_cancellations = col_double(),
## previous_bookings_not_canceled = col_double(),
## reserved_room_type = col_character(),
## assigned_room_type = col_character(),
## booking_changes = col_double(),
## deposit_type = col_character(),
## agent = col_character(),
## company = col_character(),
## days_in_waiting_list = col_double(),
## customer_type = col_character(),
## adr = col_double(),
## required_car_parking_spaces = col_double(),
## total_of_special_requests = col_double(),
## reservation_status = col_character(),
## reservation_status_date = col_date(format = "")
## )
Checking for duplicate records: Sometimes, we have duplicate records present in the data set which causes sampling bias and hence, we remove any duplicate records present in the data set.
#removing any duplicate records in the dataset
hotel_data <- unique(hotel_data)
The total unique observations has become 87396 which shows that 26.8% of the records were duplicate which were dropped off.
Checking for missing records records:
We check for missing records as it might generate wrong analysis results. Based on the situation we either remove it or impute it.
#Checking for total no of missing values per column
colSums(is.na(hotel_data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
We found out that there are 4 missing records for children column. We will impute the missing records with the median value of the column.
#Imputing the median for missing values for children column
hotel_data$children[is.na(hotel_data$children)] <- median(hotel_data$children,na.rm = T)
Checking for string inconsistencies:
We now check our data set for any structural errors, so that the data is consistant throughout the data set.
#checking for String inconsistencies
unique(hotel_data$hotel)
## [1] "Resort Hotel" "City Hotel"
unique(hotel_data$arrival_date_month)
## [1] "July" "August" "September" "October" "November" "December"
## [7] "January" "February" "March" "April" "May" "June"
unique(hotel_data$meal)
## [1] "BB" "FB" "HB" "SC" "Undefined"
unique(hotel_data$market_segment)
## [1] "Direct" "Corporate" "Online TA" "Offline TA/TO"
## [5] "Complementary" "Groups" "Undefined" "Aviation"
unique(hotel_data$distribution_channel)
## [1] "Direct" "Corporate" "TA/TO" "Undefined" "GDS"
unique(hotel_data$reserved_room_type)
## [1] "C" "A" "D" "E" "G" "F" "H" "L" "P" "B"
unique(hotel_data$assigned_room_type)
## [1] "C" "A" "D" "E" "G" "F" "I" "B" "H" "P" "L" "K"
unique(hotel_data$deposit_type)
## [1] "No Deposit" "Refundable" "Non Refund"
unique(hotel_data$customer_type)
## [1] "Transient" "Contract" "Transient-Party" "Group"
unique(hotel_data$reservation_status)
## [1] "Check-Out" "Canceled" "No-Show"
We find that in meal column both undefined and SC means no meal package, so we replace undefined with SC
#replacing undefined with SC
hotel_data$meal[hotel_data$meal == "Undefined"] <- "SC"
unique(hotel_data$meal)
## [1] "BB" "FB" "HB" "SC"
Checking for negative values:
We are checking for negative values in our data set to ensure the correctness of the data.
#checking for negative values
nrow(subset(hotel_data, hotel_data$is_canceled < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$lead_time < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$arrival_date_year < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$arrival_date_week_number < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$arrival_date_day_of_month < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$stays_in_weekend_nights < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$stays_in_week_nights < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$adults < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$children < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$babies < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$is_repeated_guest < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$previous_cancellations < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$previous_bookings_not_canceled < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$booking_changes < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$days_in_waiting_list < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$adr < 0))
## [1] 1
nrow(subset(hotel_data, hotel_data$required_car_parking_spaces < 0))
## [1] 0
nrow(subset(hotel_data, hotel_data$total_of_special_requests < 0))
## [1] 0
We can see that there is only 1 row having negative adr which makes no sense because if a room is booked, it will have some positive or zero (in case of No Deposit) average daily rate. We can drop off these records for our further analysis.
#We find that adr has one negative value, so we remove that row
hotel_data <- hotel_data %>% filter(hotel_data$adr > 0 | hotel_data$adr == 0)
Checking for irrelevant bookings We will check for booking with no adults, children and babies.
# Finding number of records when no adult, children or babies checked in
nrow(subset(hotel_data, hotel_data$adults == 0 & hotel_data$children == 0 & hotel_data$babies == 0))
## [1] 166
There 166 records showing with no person being checked in. So we will drop off these records.
#Removing 166 records where the booking is under 0 adult and 0 children and 0 babies
hotel_data <- hotel_data %>% filter(hotel_data$adults != 0 | hotel_data$children != 0 | hotel_data$babies != 0)
We will also check for booking under just babies because this seems impractical. Children of age 17 can still do bookings without being accompanied by adults.
# Finding number of records with only babies checked in
nrow(subset(hotel_data, hotel_data$adults == 0 & hotel_data$children == 0 & hotel_data$babies > 0))
## [1] 0
There are 0 records of just babies being checked in.
Variable datatypes cleaning
We have observed from the structure that all the data types for the variables are appropriate and no further changes are needed.
Checking for Null values in the data set
#checking for null value for company
nrow(subset(hotel_data, hotel_data$company == 'NULL'))
## [1] 81992
#checking for null value for agent
nrow(subset(hotel_data, hotel_data$agent == 'NULL'))
## [1] 12141
There are 81992 and 12141 null values for Company and Agent column respectively. This should not be considered a missing value, but rather as “not applicable”. For example, if a booking “Agent” is defined as “NULL” it means that the booking did not came from a travel agent.
Therefore, we will replace the NULL values with Booking not done by company and Booking not done by Agent for Company and Agent column respectively.
#replacing null values in company
hotel_data$company[hotel_data$company == "NULL"] <- "Booking not done by company"
#replacing null values in agent
hotel_data$agent[hotel_data$agent == "NULL"] <- "Booking not done by agent"
Checking for Outliers:
Finally, we will find out the outliers present in the data. These outliers may be due to incorrect data or due to the extreme values. We are going to find out the outliers of each numeric column to determine whether to keep or remove those outliers.
We have used boxplots to determine the outliers of each numeric column.
# Creating boxplot of all numeric variables
boxplot(hotel_data$lead_time, ylab = "lead_time",main = "Boxplot of lead_time")
boxplot(hotel_data$arrival_date_week_number, ylab = "arrival_date_week_number",main = "Boxplot of arrival_date_week_number")
boxplot(hotel_data$arrival_date_year, ylab = "arrival_date_year",main = "Boxplot of arrival_date_year")
boxplot(hotel_data$arrival_date_day_of_month, ylab = "arrival_date_day_of_month",main = "Boxplot of arrival_date_day_of_month")
boxplot(hotel_data$stays_in_weekend_nights, ylab = "stays_in_weekend_nights",main = "Boxplot of stays_in_weekend_nights")
boxplot(hotel_data$stays_in_week_nights, ylab = "stays_in_week_nights",main = "Boxplot of stays_in_week_nights")
boxplot(hotel_data$adults, ylab = "adults",main = "Boxplot of adults")
boxplot(hotel_data$children, ylab = "children",main = "Boxplot of children")
boxplot(hotel_data$babies, ylab = "babies",main = "Boxplot of babies")
boxplot(hotel_data$previous_cancellations, ylab = "previous_cancellations",main = "Boxplot of previous_cancellations")
boxplot(hotel_data$previous_bookings_not_canceled, ylab = "previous_bookings_not_canceled",main = "Boxplot of previous_bookings_not_canceled")
boxplot(hotel_data$booking_changes, ylab = "booking_changes",main = "Boxplot of booking_changes")
boxplot(hotel_data$days_in_waiting_list, ylab = "days_in_waiting_list",main = "Boxplot of days_in_waiting_list")
boxplot(hotel_data$adr, ylab = "adr",main = "Boxplot of Average_Daily_Rate")
boxplot(hotel_data$required_car_parking_spaces, ylab = "required_car_parking_spaces",main = "Boxplot of required_car_parking_spaces")
boxplot(hotel_data$total_of_special_requests, ylab = "total_of_special_requests",main = "Boxplot of total_of_special_requests")
From the graphs above, we can visualize that there are outliers present for all numeric columns except arrival_date_week_number
and arrival_date_day_of_month
column. These outliers seems to be something originating from extreme values and not due to error. Thus, we will keep them in our dataset.
Looking at the structure of our data set after cleaning
After doing all the above steps for data cleaning we will now look at the structure of our cleaned data set.
#examining the structure of the data set
str(hotel_data)
## tibble [87,229 x 32] (S3: tbl_df/tbl/data.frame)
## $ hotel : chr [1:87229] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:87229] 0 0 0 0 0 0 0 1 1 1 ...
## $ lead_time : num [1:87229] 342 737 7 13 14 0 9 85 75 23 ...
## $ arrival_date_year : num [1:87229] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:87229] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:87229] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:87229] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:87229] 0 0 1 1 2 2 2 3 3 4 ...
## $ adults : num [1:87229] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:87229] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:87229] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:87229] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:87229] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:87229] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:87229] "C" "C" "C" "A" ...
## $ booking_changes : num [1:87229] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:87229] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:87229] "Booking not done by agent" "Booking not done by agent" "Booking not done by agent" "304" ...
## $ company : chr [1:87229] "Booking not done by company" "Booking not done by company" "Booking not done by company" "Booking not done by company" ...
## $ days_in_waiting_list : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:87229] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:87229] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:87229] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:87229] 0 0 0 0 1 0 1 1 0 0 ...
## $ reservation_status : chr [1:87229] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:87229], format: "2015-07-01" "2015-07-01" ...
We now have a clean data set with 87,229 observations and 32 variables with which we can proceed for our data analysis. Let’s have a look at our data set in the condensed form:
kable(head(hotel_data, 32), booktabs = TRUE) %>%
kable_styling(font_size = 12)
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | Booking not done by agent | Booking not done by company | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 2015-07-01 |
Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | Booking not done by agent | Booking not done by company | 0 | Transient | 0.00 | 0 | 0 | Check-Out | 2015-07-01 |
Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | Booking not done by agent | Booking not done by company | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 2015-07-02 |
Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | Booking not done by company | 0 | Transient | 75.00 | 0 | 0 | Check-Out | 2015-07-02 |
Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 98.00 | 0 | 1 | Check-Out | 2015-07-03 |
Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | Booking not done by agent | Booking not done by company | 0 | Transient | 107.00 | 0 | 0 | Check-Out | 2015-07-03 |
Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | Booking not done by company | 0 | Transient | 103.00 | 0 | 1 | Check-Out | 2015-07-03 |
Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 82.00 | 0 | 1 | Canceled | 2015-05-06 |
Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | Booking not done by company | 0 | Transient | 105.50 | 0 | 0 | Canceled | 2015-04-22 |
Resort Hotel | 1 | 23 | 2015 | July | 27 | 1 | 0 | 4 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | E | E | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 123.00 | 0 | 0 | Canceled | 2015-06-23 |
Resort Hotel | 0 | 35 | 2015 | July | 27 | 1 | 0 | 4 | 2 | 0 | 0 | HB | PRT | Online TA | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 145.00 | 0 | 0 | Check-Out | 2015-07-05 |
Resort Hotel | 0 | 68 | 2015 | July | 27 | 1 | 0 | 4 | 2 | 0 | 0 | BB | USA | Online TA | TA/TO | 0 | 0 | 0 | D | E | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 97.00 | 0 | 3 | Check-Out | 2015-07-05 |
Resort Hotel | 0 | 18 | 2015 | July | 27 | 1 | 0 | 4 | 2 | 1 | 0 | HB | ESP | Online TA | TA/TO | 0 | 0 | 0 | G | G | 1 | No Deposit | 241 | Booking not done by company | 0 | Transient | 154.77 | 0 | 1 | Check-Out | 2015-07-05 |
Resort Hotel | 0 | 37 | 2015 | July | 27 | 1 | 0 | 4 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | E | E | 0 | No Deposit | 241 | Booking not done by company | 0 | Transient | 94.71 | 0 | 0 | Check-Out | 2015-07-05 |
Resort Hotel | 0 | 68 | 2015 | July | 27 | 1 | 0 | 4 | 2 | 0 | 0 | BB | IRL | Online TA | TA/TO | 0 | 0 | 0 | D | E | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 97.00 | 0 | 3 | Check-Out | 2015-07-05 |
Resort Hotel | 0 | 37 | 2015 | July | 27 | 1 | 0 | 4 | 2 | 0 | 0 | BB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | E | E | 0 | No Deposit | 8 | Booking not done by company | 0 | Contract | 97.50 | 0 | 0 | Check-Out | 2015-07-05 |
Resort Hotel | 0 | 12 | 2015 | July | 27 | 1 | 0 | 1 | 2 | 0 | 0 | BB | IRL | Online TA | TA/TO | 0 | 0 | 0 | A | E | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 88.20 | 0 | 0 | Check-Out | 2015-07-02 |
Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 1 | 2 | 0 | 0 | BB | FRA | Corporate | Corporate | 0 | 0 | 0 | A | G | 0 | No Deposit | Booking not done by agent | 110 | 0 | Transient | 107.42 | 0 | 0 | Check-Out | 2015-07-02 |
Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 4 | 2 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | G | G | 0 | No Deposit | 250 | Booking not done by company | 0 | Transient | 153.00 | 0 | 1 | Check-Out | 2015-07-05 |
Resort Hotel | 0 | 37 | 2015 | July | 27 | 1 | 1 | 4 | 1 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | F | F | 0 | No Deposit | 241 | Booking not done by company | 0 | Transient | 97.29 | 0 | 1 | Check-Out | 2015-07-06 |
Resort Hotel | 0 | 72 | 2015 | July | 27 | 1 | 2 | 4 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | A | A | 1 | No Deposit | 250 | Booking not done by company | 0 | Transient | 84.67 | 0 | 1 | Check-Out | 2015-07-07 |
Resort Hotel | 0 | 72 | 2015 | July | 27 | 1 | 2 | 4 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | D | D | 1 | No Deposit | 250 | Booking not done by company | 0 | Transient | 99.67 | 0 | 1 | Check-Out | 2015-07-07 |
Resort Hotel | 0 | 127 | 2015 | July | 27 | 1 | 2 | 5 | 2 | 0 | 0 | HB | GBR | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | I | 0 | No Deposit | 115 | Booking not done by company | 0 | Contract | 94.95 | 0 | 1 | Check-Out | 2015-07-01 |
Resort Hotel | 0 | 78 | 2015 | July | 27 | 1 | 2 | 5 | 2 | 0 | 0 | BB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 5 | Booking not done by company | 0 | Transient | 63.60 | 1 | 0 | Check-Out | 2015-07-08 |
Resort Hotel | 0 | 48 | 2015 | July | 27 | 1 | 2 | 5 | 2 | 0 | 0 | BB | IRL | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 8 | Booking not done by company | 0 | Contract | 79.50 | 0 | 0 | Check-Out | 2015-07-08 |
Resort Hotel | 1 | 60 | 2015 | July | 27 | 1 | 2 | 5 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | E | E | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 107.00 | 0 | 2 | Canceled | 2015-05-11 |
Resort Hotel | 0 | 77 | 2015 | July | 27 | 1 | 2 | 5 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 94.00 | 0 | 0 | Check-Out | 2015-07-08 |
Resort Hotel | 0 | 99 | 2015 | July | 27 | 1 | 2 | 5 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 240 | Booking not done by company | 0 | Transient | 87.30 | 1 | 1 | Check-Out | 2015-07-08 |
Resort Hotel | 0 | 118 | 2015 | July | 27 | 1 | 4 | 10 | 1 | 0 | 0 | BB | NULL | Direct | Direct | 0 | 0 | 0 | A | A | 2 | No Deposit | Booking not done by agent | Booking not done by company | 0 | Transient | 62.00 | 0 | 2 | Check-Out | 2015-07-15 |
Resort Hotel | 0 | 95 | 2015 | July | 27 | 1 | 4 | 11 | 2 | 0 | 0 | BB | GBR | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 241 | Booking not done by company | 0 | Transient | 63.86 | 0 | 0 | Check-Out | 2015-07-16 |
Resort Hotel | 1 | 96 | 2015 | July | 27 | 1 | 2 | 8 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | E | E | 0 | No Deposit | Booking not done by agent | Booking not done by company | 0 | Transient | 108.30 | 0 | 2 | Canceled | 2015-05-29 |
Resort Hotel | 0 | 69 | 2015 | July | 27 | 2 | 2 | 4 | 2 | 0 | 0 | BB | IRL | Offline TA/TO | TA/TO | 0 | 0 | 0 | A | C | 0 | No Deposit | 175 | Booking not done by company | 0 | Transient | 65.50 | 0 | 0 | Check-Out | 2015-07-08 |
A detailed description of the extracted variables, their origin, and the engineering procedures employed in its creation is shown in the below table. The names of the tables can be referred from Diagram.1 in Data Description tab.
options(warn = -1)
#Preparing a table to show information regarding each variable in the data set
Variable_Name <- c("ADR","Adults","Agent","ArrivalDateDayOfMonth","ArrivalDateMonth","ArrivalDateWeekNumber","ArrivalDateYear","AssignedRoomType","Babies","BookingChanges","Children","Company","Country","CustomerType","DaysInWaitingList","DepositType","DistributionChannel","IsCanceled","IsRepeatedGuest","LeadTime","MarketSegment","Meal","PreviousBookingsNotCanceled","PreviousCancellations","RequiredCardParkingSpaces","ReservationStatus","ReservationStatusDate","ReservedRoomType","StaysInWeekendNights","StaysInWeekNights","TotalOfSpecialRequests")
Variable_Type <- c("Numeric","Integer","Categorical","Integer","Categorical","Integer","Integer","Categorical","Integer","Integer","Integer","Categorical","Categorical","Categorical","Integer","Categorical","Categorical","Categorical","Categorical","Integer","Categorical","Categorical","Integer","Integer","Integer","Categorical","Date","Categorical","Integer","Integer","Integer")
Variable_Description <- c("Average Daily Rate as defined",
"Number of adults",
"ID of the travel agency that made the bookinga",
"Day of the month of the arrival date",
"Month of arrival date with 12 categories: “January” to “December”",
"Week number of the arrival date",
"Year of arrival date",
"Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons",
"Number of babies",
"Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation",
"Number of children",
"ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons",
"Country of origin. Categories are represented in the ISO 3155–3:2013 format",
"Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking",
"Number of days the booking was in the waiting list before it was confirmed to the customer",
"Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.",
"Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
"Value indicating if the booking was canceled (1) or not (0)",
"Value indicating if the booking name was from a repeated guest (1) or not (0)",
"Number of days that elapsed between the entering date of the booking into the PMS and the arrival date",
"Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
"Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)",
"Number of previous bookings not cancelled by the customer prior to the current booking",
"Number of previous bookings that were cancelled by the customer prior to the current booking",
"Number of car parking spaces required by the customer",
"Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why",
"Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel",
"Code of room type reserved. Code is presented instead of designation for anonymity reasons",
"Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel",
"Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel",
"Number of special requests made by the customer (e.g. twin bed or high floor)")
Variable_Source <- c("BO, BL and TR / Calculated by dividing the sum of all lodging transactions by the total number of staying nights",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL",
"BO and BL/Calculated by adding the number of unique iterations that change some of the booking attributes, namely: persons, arrival date, nights, reserved room type or meal",
"BO and BL/Sum of both payable and non-payable children",
"BO and BL.",
"BO, BL and NT",
"BO and BL",
"BO/Calculated by subtracting the date the booking was confirmed to the customer from the date the booking entered on the PMS",
"BO and TR/Value calculated based on the payments identified for the booking in the transaction (TR) table before the booking׳s arrival or cancellation date. In case no payments were found the value is “No Deposit”. If the payment was equal or exceeded the total cost of stay, the value is set as “Non Refund”. Otherwise the value is set as “Refundable”",
"BO, BL and DC",
"BO",
"BO, BL and C/ Variable created by verifying if a profile was associated with the booking customer. If so, and if the customer profile creation date was prior to the creation date for the booking on the PMS database it was assumed the booking was from a repeated guest",
"BO and BL/ Subtraction of the entering date from the arrival date",
"BO, BL and MS",
"BO, BL and ML",
"BO and BL / In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and not canceled.",
"BO and BL/ In case there was no customer profile associated with the booking, the value is set to 0. Otherwise, the value is the number of bookings with the same customer profile created before the current booking and canceled.",
"BO and BL",
"BO",
"BO",
"BO and BL",
"BO and BL/ Calculated by counting the number of weekend nights from the total number of nights",
"BO and BL/Calculated by counting the number of week nights from the total number of nights",
"BO and BL/Sum of all special requests")
Data_Description <- as_data_frame(cbind(Variable_Name, Variable_Type, Variable_Description, Variable_Source))
DT::datatable(head(Data_Description,35))
We will now do a summary analysis of each variable in the data set using describe().
#obtaining summary statistics
describe(hotel_data)
## vars n mean sd median trimmed mad
## hotel* 1 87229 1.39 0.49 1.0 1.36 0.00
## is_canceled 2 87229 0.28 0.45 0.0 0.22 0.00
## lead_time 3 87229 79.97 86.06 49.0 65.96 65.23
## arrival_date_year 4 87229 2016.21 0.69 2016.0 2016.26 1.48
## arrival_date_month* 5 87229 6.29 3.47 6.0 6.26 4.45
## arrival_date_week_number 6 87229 26.84 13.67 27.0 26.76 16.31
## arrival_date_day_of_month 7 87229 15.82 8.84 16.0 15.82 11.86
## stays_in_weekend_nights 8 87229 1.00 1.03 1.0 0.92 1.48
## stays_in_week_nights 9 87229 2.62 2.04 2.0 2.41 1.48
## adults 10 87229 1.88 0.62 2.0 1.89 0.00
## children 11 87229 0.14 0.46 0.0 0.00 0.00
## babies 12 87229 0.01 0.11 0.0 0.00 0.00
## meal* 13 87229 1.55 1.07 1.0 1.32 0.00
## country* 14 87229 88.16 45.11 77.0 88.92 66.72
## market_segment* 15 87229 6.08 1.34 7.0 6.31 0.00
## distribution_channel* 16 87229 3.53 0.95 4.0 3.73 0.00
## is_repeated_guest 17 87229 0.04 0.19 0.0 0.00 0.00
## previous_cancellations 18 87229 0.03 0.37 0.0 0.00 0.00
## previous_bookings_not_canceled 19 87229 0.18 1.73 0.0 0.00 0.00
## reserved_room_type* 20 87229 2.26 1.83 1.0 1.95 0.00
## assigned_room_type* 21 87229 2.66 1.97 1.0 2.39 0.00
## booking_changes 22 87229 0.27 0.71 0.0 0.10 0.00
## deposit_type* 23 87229 1.01 0.13 1.0 1.00 0.00
## agent* 24 87229 226.80 116.81 315.0 238.23 28.17
## company* 25 87229 337.58 51.21 349.0 349.00 0.00
## days_in_waiting_list 26 87229 0.75 10.00 0.0 0.00 0.00
## customer_type* 27 87229 3.06 0.53 3.0 3.04 0.00
## adr 28 87229 106.52 54.89 98.2 102.49 44.85
## required_car_parking_spaces 29 87229 0.08 0.28 0.0 0.00 0.00
## total_of_special_requests 30 87229 0.70 0.83 0.0 0.58 0.00
## reservation_status* 31 87229 1.75 0.46 2.0 1.80 0.00
## reservation_status_date 32 87229 NaN NA NA NaN NA
## min max range skew kurtosis se
## hotel* 1 2 1 0.45 -1.79 0.00
## is_canceled 0 1 1 1.01 -0.99 0.00
## lead_time 0 737 737 1.43 2.12 0.29
## arrival_date_year 2015 2017 2 -0.30 -0.88 0.00
## arrival_date_month* 1 12 11 0.02 -1.20 0.01
## arrival_date_week_number 1 53 52 0.02 -0.95 0.05
## arrival_date_day_of_month 1 31 30 0.00 -1.20 0.03
## stays_in_weekend_nights 0 19 19 1.34 7.07 0.00
## stays_in_week_nights 0 50 50 2.58 20.10 0.01
## adults 0 55 55 20.39 1391.01 0.00
## children 0 10 10 3.46 12.92 0.00
## babies 0 10 10 21.13 1202.71 0.00
## meal* 1 4 3 1.53 0.57 0.00
## country* 1 178 177 0.12 -1.46 0.15
## market_segment* 1 8 7 -1.27 0.44 0.00
## distribution_channel* 1 5 4 -1.63 1.02 0.00
## is_repeated_guest 0 1 1 4.79 20.98 0.00
## previous_cancellations 0 26 26 34.33 1725.48 0.00
## previous_bookings_not_canceled 0 72 72 20.46 578.92 0.01
## reserved_room_type* 1 9 8 1.09 -0.07 0.01
## assigned_room_type* 1 11 10 0.81 -0.23 0.01
## booking_changes 0 18 18 5.07 53.99 0.00
## deposit_type* 1 3 2 9.82 106.95 0.00
## agent* 1 334 333 -0.50 -1.49 0.40
## company* 1 349 348 -4.75 22.01 0.17
## days_in_waiting_list 0 391 391 19.47 483.77 0.03
## customer_type* 1 4 3 -1.39 6.59 0.00
## adr 0 5400 5400 11.02 992.18 0.19
## required_car_parking_spaces 0 8 8 3.49 21.71 0.00
## total_of_special_requests 0 5 5 1.08 0.82 0.00
## reservation_status* 1 3 2 -0.78 -0.46 0.00
## reservation_status_date Inf -Inf -Inf NA NA NA
Following are the problem statements which we are trying to solve with our analysis:
All the three problem statement is to help improve the revenue generation of the hotel business.
Let’s start with the analysis!
Uncovering Information:
Data analysis to find a solution to our problems can be done using plots and graphs which has been implemented below by slicing the data, creating new dataframes, splitting column values
We first would prepare data to find the busiest time for hotel booking.
We split the column reservation_status_date
into three new columns month
,year
, and day
. This will help us to do analysis on our first problem statement.
# Sub-setting the data with only reservation_status_date and reservation_status #
reservation_date <- hotel_data[,c(31,32)]
head(reservation_date)
## # A tibble: 6 x 2
## reservation_status reservation_status_date
## <chr> <date>
## 1 Check-Out 2015-07-01
## 2 Check-Out 2015-07-01
## 3 Check-Out 2015-07-02
## 4 Check-Out 2015-07-02
## 5 Check-Out 2015-07-03
## 6 Check-Out 2015-07-03
# Splitting the reservation_status_date column into 3 columns based on Year , Month & Day #
reservation_date <- separate(reservation_date,reservation_status_date,c("year", "month","day"), sep = "-")
head(reservation_date)
## # A tibble: 6 x 4
## reservation_status year month day
## <chr> <chr> <chr> <chr>
## 1 Check-Out 2015 07 01
## 2 Check-Out 2015 07 01
## 3 Check-Out 2015 07 02
## 4 Check-Out 2015 07 02
## 5 Check-Out 2015 07 03
## 6 Check-Out 2015 07 03
# Analyzing Count of Booking per Year #
yearwise_data <- sqldf("select count(reservation_status) as [No_of_bookings], year from reservation_date group by year")
#plotting the bar graph for bookings based on year
booking_year <- ggplot(yearwise_data, aes(x = year, y = No_of_bookings)) +
geom_bar(stat = "identity", fill = "lightblue") + ggtitle("No of Bookings vs Year")
booking_year
# Analyzing Count of Booking per Month #
monthwise_data <- sqldf("select count(reservation_status) as [No_of_bookings],month from reservation_date group by month")
#plotting the bar graph for bookings based on month
booking_month <- ggplot(monthwise_data, aes(x = month, y = No_of_bookings)) +
geom_bar(stat = "identity", fill = "lightpink") + ggtitle("# of bookings across months")
booking_month
From the above two graphs we find out that August has been the busiest month of the year with the highest number of booking and the year 2016 has seen the maximum number of bookings.
Now we will analyse the booking based on hotel type monthly and yearly.
#plotting graph for yearly bookings based on hotel type
ggplot(data = hotel_data,aes(arrival_date_year,fill = (hotel))) +
geom_bar(position = 'dodge') +
scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
xlab("Years") +
ggtitle("Hotel Type vs # of bookings across years") +
labs(fill = 'Hotel Type')
We can see that for both the hotel type highest number of bookings has been in the year 2016.
#Rearranging the factors to arrange months
hotel_data <- hotel_data %>%
mutate(Months = fct_relevel(arrival_date_month,"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December"))
#plotting the graph for monthly bookings based on hotel type
ggplot(data = hotel_data,aes(Months,fill = (hotel))) +
geom_bar(position = 'dodge') +
facet_grid(hotel_data$arrival_date_year) +
scale_y_continuous(name = "# of Bookings",labels = scales::comma) +
xlab("Months") +
ggtitle("Hotel Type vs # of bookings across months") +
labs(fill = 'Hotel Type')
For City Hotel in the year 2015 bookings have been maximum in September, 2016 in August and 2017 in May.
For Resort Hotel in the year bookings has been maximum in August for all the three years.
Now we proceed with finding the whether the bookings have been more for weekends or weekdays.
# Sub-setting the data with only reservation_status_date, stays_in_weekend_nights and stays_in_week_nights #
reservation_day <- hotel_data[,c(8,9,32)]
head(reservation_day)
## # A tibble: 6 x 3
## stays_in_weekend_nights stays_in_week_nights reservation_status_date
## <dbl> <dbl> <date>
## 1 0 0 2015-07-01
## 2 0 0 2015-07-01
## 3 0 1 2015-07-02
## 4 0 1 2015-07-02
## 5 0 2 2015-07-03
## 6 0 2 2015-07-03
# Splitting the reservation_status_date column into 3 columns based on Year , Month & Day #
reservation_day <- separate(reservation_day,reservation_status_date,c("year", "month","day"), sep = "-")
head(reservation_day)
## # A tibble: 6 x 5
## stays_in_weekend_nights stays_in_week_nights year month day
## <dbl> <dbl> <chr> <chr> <chr>
## 1 0 0 2015 07 01
## 2 0 0 2015 07 01
## 3 0 1 2015 07 02
## 4 0 1 2015 07 02
## 5 0 2 2015 07 03
## 6 0 2 2015 07 03
#preparing table to analyze the difference between bookings for weekend nights and weekday nights #
weekwise_date <- sqldf("select sum(stays_in_weekend_nights) as No_of_weekend_nights , sum(stays_in_week_nights) as No_of_weekday_nights,month
from reservation_day group by month")
#plotting line chart for booking based on weekends and weekdays
colors <- c("No_of_weekend_nights" = "darkred", "No_of_weekday_nights" = "orange")
ggplot(weekwise_date, aes(x = month, group = 1)) +
geom_line(aes(y = No_of_weekend_nights, color = "No_of_weekend_nights")) +
geom_line(aes(y = No_of_weekday_nights, color = "No_of_weekday_nights")) + labs(y = "No of nights stayed", x = "Month", color = "WeekDay/Weekend Night") + scale_color_manual(values = colors) + ggtitle("# weekday/weekend nights stayed across Months")
#calculating the weekly summary
week_summary <- hotel_data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(avg.weekend.stay = round(sum(stays_in_weekend_nights)/2),
avg.weekday.stay = round(sum(stays_in_week_nights)/5))
## `summarise()` has grouped output by 'hotel'. You can override using the `.groups` argument.
paged_table(week_summary)
We know that majority of holidays call for hotels and we expect a high volume of booking on Weekends as compared to Weekdays.
But from the above graph and summary table we see that number of bookings during the weekday is higher than the weekends, one of the reasons can be the corporate meetings and get together happen on Weekdays.
Now lets start with our second problem statement, to find the customer type helping in generating more revenue for the hotel.
# Total Number of cancellations for existing customers and new customers
hotel_data %>%
group_by(is_repeated_guest,is_canceled) %>%
summarise(length(is_canceled))
## `summarise()` has grouped output by 'is_repeated_guest'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups: is_repeated_guest [2]
## is_repeated_guest is_canceled `length(is_canceled)`
## <dbl> <dbl> <int>
## 1 0 0 60117
## 2 0 1 23749
## 3 1 0 3103
## 4 1 1 260
From above, it is clear that number of bookings cancelled for new customers is 23745 while that of existing customers is 260. Thus, it is evident that the existing customers tend to trust the hotels and thus add more revenue to the hotel.
Now lets analyze our final problem statement - factors leading to cancellation.
ggplot(data = hotel_data) +
geom_bar(mapping = aes(x = is_canceled, fill = hotel)) +
scale_fill_manual(values = c("#ffd700", "steelblue")) +
labs(title = "Cancellation rate by hotel") +
xlab("Is Canceled") +
ylab("Count")
From the above graph we can conclude that the cancellation rate of city hotel is higher than the resort hotel.
# finding the effect of Lead Time on Booking Cancellation
hotel_data %>%
group_by(is_canceled) %>%
summarise(mean = mean(lead_time))
## # A tibble: 2 x 2
## is_canceled mean
## <dbl> <dbl>
## 1 0 70.2
## 2 1 106.
From above result, we can see that when average lead time is 106, customers tend to cancel their bookings. However, bookings are not cancelled if average lead time is 70.
#finding the impact of Market Segment on Cancelled Bookings
hotel_data %>%
ggplot(aes(is_canceled, fill = hotel_data$market_segment)) +
geom_bar(position = "fill") +
labs(title = "Behavior of Booking Status",
subtitle = hotel_data$market_segment,
fill="Market Segment",
x = "Booking Status",
y = "Percentage of Cancellation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "blue"))
#finding the impact of Distribution Channel on Cancelled Bookings
hotel_data %>%
ggplot(aes(is_canceled, fill = hotel_data$distribution_channel)) +
geom_bar(position = "fill") +
labs(title = "Behavior of Booking Status",
subtitle = hotel_data$distribution_channel,
fill="Distribution Channel",
x = "Booking Status",
y = "Percentage of Cancellation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "red"))
From the above two graphs we can see that in market segment - Online TA and in distribution channel - TA/TO contribute more towards booking cancellation.
#finding the impact of previous Cancellation on Booking Cancellation
hotel_data %>%
group_by(is_canceled) %>%
summarise(mean = mean(previous_cancellations))
## # A tibble: 2 x 2
## is_canceled mean
## <dbl> <dbl>
## 1 0 0.0187
## 2 1 0.0613
From the above summary we can see people who have done cancellation previously are tend to cancel their current bookings as well.
#finding the impact of Wait Period on Booking Cancellation
hotel_data %>%
group_by(is_canceled) %>%
summarise(mean(days_in_waiting_list))
## # A tibble: 2 x 2
## is_canceled `mean(days_in_waiting_list)`
## <dbl> <dbl>
## 1 0 0.717
## 2 1 0.823
From above, we can see more cancellations are done when average wait period is more than 82 days.
To find the revenue generation of each hotel, we have created a variable to store revenue earned from each booking.
#adding a column showing revenue per booking
hotel_data <- hotel_data %>%
mutate(Revenue_Per_Booking = adr * (stays_in_weekend_nights + stays_in_week_nights))
dim(hotel_data)
## [1] 87229 34
Now we have 34 variables in our data set.
Let’s see the summary of the new variable added.
#Summary Statistics of Revenue_Per_Booking
summary(hotel_data$Revenue_Per_Booking)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 153.0 299.2 395.0 504.0 7590.0
The above summary indicates that the average revenue generated per booking is $395.
We are incorporating Linear Modeling to understand the impact of the other variables on Revenue.
To build the model , we would first prepare the data week wise. We would like to include Week number of arrival, hotel type, customer type, meal type, is_repeated_guest, total number of bookings, total revenue for the week, average adr and average stay nights.
#finding the total number of nights the guest stayed including weekday nights and weekend nights
hotel_data$total_stay_nights <- rowSums(cbind(hotel_data$stays_in_weekend_nights, hotel_data$stays_in_week_nights), na.rm = TRUE)
#finding the total revenue earned by each booking by multiplying total number of nights and average daily price
hotel_data <- hotel_data %>% mutate(revenue = (total_stay_nights * adr))
#preparing the model
data_model <- hotel_data %>% filter(is_canceled == 0) %>% group_by(arrival_date_week_number, hotel, customer_type, meal, is_repeated_guest) %>% summarise(total_bookings = n(), total_guests = sum(adults + children + babies), average_stay_nights = mean(total_stay_nights, na.rm = TRUE), week_adr = mean(adr, na.rm = TRUE) ,total_revenue = sum(revenue))
## `summarise()` has grouped output by 'arrival_date_week_number', 'hotel', 'customer_type', 'meal'. You can override using the `.groups` argument.
model_1 <- lm(total_revenue ~ as.factor(hotel) + as.factor(customer_type) + as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + week_adr, data = data_model)
summary(model_1)
##
## Call:
## lm(formula = total_revenue ~ as.factor(hotel) + as.factor(customer_type) +
## as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights +
## week_adr, data = data_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52144 -19784 -3797 8512 238367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6809.60 3559.52 1.913 0.0560
## as.factor(hotel)Resort Hotel -750.30 2092.62 -0.359 0.7200
## as.factor(customer_type)Group -3445.48 3543.84 -0.972 0.3311
## as.factor(customer_type)Transient 39166.66 2980.50 13.141 <2e-16
## as.factor(customer_type)Transient-Party 7900.53 3066.95 2.576 0.0101
## as.factor(meal)FB -54489.48 4404.66 -12.371 <2e-16
## as.factor(meal)HB -30587.91 2409.70 -12.694 <2e-16
## as.factor(meal)SC -33358.52 2472.32 -13.493 <2e-16
## as.factor(is_repeated_guest)1 -30760.10 2220.00 -13.856 <2e-16
## average_stay_nights 451.32 351.29 1.285 0.1991
## week_adr 196.51 20.76 9.467 <2e-16
##
## (Intercept) .
## as.factor(hotel)Resort Hotel
## as.factor(customer_type)Group
## as.factor(customer_type)Transient ***
## as.factor(customer_type)Transient-Party *
## as.factor(meal)FB ***
## as.factor(meal)HB ***
## as.factor(meal)SC ***
## as.factor(is_repeated_guest)1 ***
## average_stay_nights
## week_adr ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33530 on 1334 degrees of freedom
## Multiple R-squared: 0.3675, Adjusted R-squared: 0.3628
## F-statistic: 77.52 on 10 and 1334 DF, p-value: < 2.2e-16
Results
Based on our exploratory data analysis, below are some of the insights we gained:
August has been the busiest month of the year with the highest number of booking and the year 2016 has seen the maximum number of bookings.
The number of bookings during the weekday is higher than the weekends, one of the reasons can be the corporate meetings and get together happen on Weekdays.
The existing customers tend to trust the hotels and thus add more revenue to the hotel.
The cancellation rate of city hotel is higher than the resort hotel.
When average lead time is more, customers tend to cancel their bookings.
In market segment - Online TA and in distribution channel - TA/TO contribute more towards booking cancellation.
People who have done more number of booking cancellation previously are tend to cancel their current bookings as well.
More cancellations are done when average wait period is more than 82 days.
The average revenue generated per booking is $395.
Based on our findings we can suggest that to increase the revenue hotel industry can pay more attention to factors such as average wait time, lead time, customer type, booking days, etc.