Authors: Fnu Jagriti | Sai Seetha Ram Anne | Deepthi Rajagopal
The future of the hotel and hospitality industries is a hot topic, and the paths to success are potentially many. The hotel industry has been hit hard due to the corona virus pandemic which left a grave impact on society. Consumer spending behavior has changed and will continue to change, stipulating business plans to adapt to these new circumstances. Therefore, it is very important for hotel industries to gain a leg up on their competitors by evolving their business strategy to acquire new customers and retain existing customers. This is possible by analyzing the real-time consumer data to understand customer behavior, finding the gaps and trends in their booking history, and tailoring the business strategy accordingly.
Our data set contains hotel demand data of two hotels in Portugal: Resort Hotel of Algarve and City Hotel of Lisbon. This data set has 31 variables describing 40,060 observations of Resort hotel and 79,330 observations of City Hotel i.e. we have almost double records for City Hotel than Resort Hotel.
The primary purpose of our analysis is to provide insights for effective revenue management by predicting the number of booking cancellations for each hotel type over time based on the current trend and by implementing plans to minimize such cancellations.
Our analysis aims to resolve two problem statements which aims at answering some burning questions:
To determine the number of bookings that are cancelled for each hotel type. It is important to minimize this as hotels can’t afford to lose their existing customers as it is always easier to retain the existing ones than to acquire new ones.
To explore the reason for their cancellation so that it can be avoided from the current customers.
To find the correlation among other variables and drop columns which are highly correlated and do not contribute to any significant information.
To develop a prediction model to determine the likelihood of booking cancellation so that it can be avoided by putting appropriate action plan in place.
To create new variable that determines the average revenue generation for each hotel. It helps in determining which hotel is most preferred and what needs to done to hotel to maintain a competitive edge.
To find the seasonality trend of revenue generation so that hotels can do strategic plannings to make up for the loss in a month/year in next ones.
Our approach is to determine the number of cancellations done in both the hotels and to find the dependency of the variables on other variables present in the data set. This helps in determining the factors which can be controlled to minimize the booking cancellations. Further,once we have determined the variables, we can apply our mode - decision tree or random forest to predict the booking cancellations by training our model with the current data. Additionally, we will also try to find out the revenue generation of each hotel and the factors which drive it, so that we can improve factors to minimize booking cancellations to maximize the revenue generation.
We will start with the data pre-processing to convert our raw data into appropriate formats and remove any duplicate values and insignificant data. We will then clean the data by looking for NA’s or missing values and then imputing them or dropping them as required. We will then do exploratory data analysis such as checking summary statistics, converting data type of variables, finding skewness of each variables to determine the outliers etc. We will use different visualizations to gauge different metrics associated with the booking cancellations and revenue generation. Finally, we will create our model to predict the booking cancellations.
The key insights derived from this analysis will help the hotels in determining the bookings which are likely to be cancelled and they can provide discounts and offers based on the consumer behavior to avoid such cancellations. Also, the analysis will also help in determining the factors which drives revenue generation and how the factors undermining it can be improved upon to maximize their earnings.
To reproduce the codes and results throughout this project without any errors, we need following packages:
library(tidyverse) #Data Manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tibble) #Creating Tibbles
library(tidyr) #Creating Tidy Data
library(dplyr) #Data Analysis
library(DT) #Display Data set
library(ggplot2) #Data Visualization
library(magrittr) #Piping Operator %>%
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(knitr) #To Display tables
The dataset we are using for our analysis comes from an open hotel booking demand Dataset present at Github repository.
Our dataset is a combination of demand data of two hotels. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). The dataset has 32 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. It shows bookings done between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.
Data Dictionary gives a description of each variable i.e. it tells what does each of the columns in the dataset represent. It is very helpful in understanding our data better and recognizing any anomalies in dataset.
options(warn = -1)
Variable_Name <- c("hotel","is_canceled","lead_time","arrival_date_year","arrival_date_month","arrival_date_week_number","arrival_date_day_of_month","stays_in_weekend_nights","stays_in_week_nights","adults","children","babies","meal","country","market_segment","distribution_channel","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","reserved_room_type","assigned_room_type","booking_changes","deposit_type","agent","company","days_in_waiting_list","customer_type","adr","required_car_parking_spaces","total_of_special_requests","reservation_status","reservation_status_date")
Variable_Description <- c("Hotel (H1 = Resort Hotel or H2 = City Hotel)","Value indicating if the booking was canceled (1) or not (0)","Number of days that elapsed between the entering date of the booking into the PMS and the arrival date","Year of arrival date","Month of arrival date","Week number of year for arrival date","Day of arrival date","Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel","Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel","Number of adults","Number of children","Number of babies","Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)","Country of origin. Categories are represented in the ISO 3155–3:2013 format","Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”","Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”","Value indicating if the booking name was from a repeated guest (1) or not (0)",
"Number of previous bookings that were cancelled by the customer prior to the current booking","Number of previous bookings not cancelled by the customer prior to the current booking","Code of room type reserved. Code is presented instead of designation for anonymity reasons","Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons","Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation","Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.","ID of the travel agency that made the booking", "ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons","Number of days the booking was in the waiting list before it was confirmed to the customer","Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking","Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights","Number of car parking spaces required by the customer","Number of special requests made by the customer (e.g. twin bed or high floor)","Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why","Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel")
Data_Dictionary <- as_data_frame(cbind(Variable_Name, Variable_Description))
Data_Dictionary
## # A tibble: 32 x 2
## Variable_Name Variable_Description
## <chr> <chr>
## 1 hotel Hotel (H1 = Resort Hotel or H2 = City Hotel)
## 2 is_canceled Value indicating if the booking was canceled (1) o~
## 3 lead_time Number of days that elapsed between the entering d~
## 4 arrival_date_year Year of arrival date
## 5 arrival_date_month Month of arrival date
## 6 arrival_date_week_number Week number of year for arrival date
## 7 arrival_date_day_of_month Day of arrival date
## 8 stays_in_weekend_nights Number of weekend nights (Saturday or Sunday) the ~
## 9 stays_in_week_nights Number of week nights (Monday to Friday) the guest~
## 10 adults Number of adults
## # ... with 22 more rows
We have read our csv file and found that there are 119390 rows and 32 columns in the dataset.
Hotels_Data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Missing Values Check:
#To check for missing values in each column
colSums(is.na(Hotels_Data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
From above, we can see we have 4 missing values for children column and no missing values for rest of the variables.
Removing NA or Missing Value Records:
Hotels_Data <- na.omit(Hotels_Data)
Hotels_Data
## # A tibble: 119,386 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## # ... with 119,376 more rows, and 27 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
Since, we have just 4 missing values in children column, we have removed those records as it will not hinder our further analysis. Now, we have 119386 records present in our dataset.
Removing Duplicate Records:
Sometimes, we have same records present in the dataset multiple times. This causes sampling bias and hence, we remove any duplicate records present in the dataset.
Hotels_Data <- unique(Hotels_Data)
The total unique observations has become 87392 which shows that approximately 27% of our records were duplicate which were dropped off.
Checking for Correctness of Data in Each Column
From the data dictionary, we can see column meal has two values SC and Undefined which basically represent the same thing i.e.no meal package. Thus, we replace “Undefined” with “SC”.
# Replacing the two values representing the same thing by a single value
Hotels_Data$meal <- replace(Hotels_Data$meal,Hotels_Data$meal == 'Undefined','SC')
# Structure of Dataset
str(Hotels_Data)
## tibble [87,392 x 32] (S3: tbl_df/tbl/data.frame)
## $ hotel : chr [1:87392] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:87392] 0 0 0 0 0 0 0 1 1 1 ...
## $ lead_time : num [1:87392] 342 737 7 13 14 0 9 85 75 23 ...
## $ arrival_date_year : num [1:87392] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:87392] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:87392] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:87392] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:87392] 0 0 1 1 2 2 2 3 3 4 ...
## $ adults : num [1:87392] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:87392] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:87392] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:87392] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:87392] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:87392] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:87392] "C" "C" "C" "A" ...
## $ booking_changes : num [1:87392] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:87392] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:87392] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:87392] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:87392] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:87392] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:87392] 0 0 0 0 1 0 1 1 0 0 ...
## $ reservation_status : chr [1:87392] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:87392], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "na.action")= 'omit' Named int [1:4] 40601 40668 40680 41161
## ..- attr(*, "names")= chr [1:4] "40601" "40668" "40680" "41161"
From the structure of our dataset, we can see a lot of categorical data is of character data type and do not provide much information in this state. Summary() function handles factors differently than numbers (and strings), the occurrence counts for each value is often more useful information. Therefore, our next step is to convert categorical and binary variables into factors and find the summary statistics.
# Conversion of Categorical and Binary Variables to Factors
Hotels_Data <- Hotels_Data %>%
mutate(hotel = as.factor(hotel),
is_canceled = as.factor(is_canceled),
arrival_date_day_of_month = as.factor(arrival_date_day_of_month),
arrival_date_month = as.factor(arrival_date_month),
arrival_date_year = as.factor(arrival_date_year),
meal = as.factor(meal),
country = as.factor(country),
market_segment = as.factor(market_segment),
distribution_channel = as.factor(distribution_channel),
is_repeated_guest = as.factor(is_repeated_guest),
reserved_room_type = as.factor(reserved_room_type),
assigned_room_type = as.factor(assigned_room_type),
deposit_type = as.factor(deposit_type),
customer_type = as.factor(customer_type),
reservation_status = as.factor(reservation_status),
agent = as.factor(agent),
company = as.factor(company)
)
# Finding glimpse of Hotels Dataset after conversion of variables to factors
glimpse(Hotels_Data)
## Rows: 87,392
## Columns: 32
## $ hotel <fct> Resort Hotel, Resort Hotel, Resort Hote~
## $ is_canceled <fct> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, ~
## $ lead_time <dbl> 342, 737, 7, 13, 14, 0, 9, 85, 75, 23, ~
## $ arrival_date_year <fct> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month <fct> July, July, July, July, July, July, Jul~
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, ~
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal <fct> BB, BB, BB, BB, BB, BB, FB, BB, HB, BB,~
## $ country <fct> PRT, PRT, GBR, GBR, GBR, PRT, PRT, PRT,~
## $ market_segment <fct> Direct, Direct, Direct, Corporate, Onli~
## $ distribution_channel <fct> Direct, Direct, Direct, Corporate, TA/T~
## $ is_repeated_guest <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type <fct> C, C, A, A, A, C, C, A, D, E, D, D, G, ~
## $ assigned_room_type <fct> C, C, C, A, A, C, C, A, D, E, D, E, G, ~
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ deposit_type <fct> No Deposit, No Deposit, No Deposit, No ~
## $ agent <fct> NULL, NULL, NULL, 304, 240, NULL, 303, ~
## $ company <fct> NULL, NULL, NULL, NULL, NULL, NULL, NUL~
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type <fct> Transient, Transient, Transient, Transi~
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 107.00~
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 3, 1, ~
## $ reservation_status <fct> Check-Out, Check-Out, Check-Out, Check-~
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~
# Summary Statistics
summary(Hotels_Data)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :53424 0:63371 Min. : 0.00 2015:13309
## Resort Hotel:33968 1:24021 1st Qu.: 11.00 2016:42391
## Median : 49.00 2017:31692
## Mean : 79.89
## 3rd Qu.:125.00
## Max. :737.00
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :11253 Min. : 1.00 17 : 3020
## July :10057 1st Qu.:16.00 2 : 3016
## May : 8355 Median :27.00 26 : 3000
## April : 7908 Mean :26.84 5 : 2978
## June : 7765 3rd Qu.:37.00 16 : 2959
## March : 7513 Max. :53.00 19 : 2949
## (Other):34541 (Other):69470
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 1.000 Median : 2.000 Median : 2.000
## Mean : 1.005 Mean : 2.625 Mean : 1.876
## 3rd Qu.: 2.000 3rd Qu.: 4.000 3rd Qu.: 2.000
## Max. :19.000 Max. :50.000 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.00000 BB:67974 PRT :27449
## 1st Qu.: 0.0000 1st Qu.: 0.00000 FB: 360 GBR :10433
## Median : 0.0000 Median : 0.00000 HB: 9085 FRA : 8837
## Mean : 0.1386 Mean : 0.01082 SC: 9973 ESP : 7252
## 3rd Qu.: 0.0000 3rd Qu.: 0.00000 DEU : 5387
## Max. :10.0000 Max. :10.00000 ITA : 3066
## (Other):24968
## market_segment distribution_channel is_repeated_guest
## Aviation : 227 Corporate: 5081 0:83977
## Complementary: 702 Direct :12988 1: 3415
## Corporate : 4212 GDS : 181
## Direct :11803 TA/TO :69141
## Groups : 4942 Undefined: 1
## Offline TA/TO:13889
## Online TA :51617
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.000 A :56552
## 1st Qu.: 0.00000 1st Qu.: 0.000 D :17398
## Median : 0.00000 Median : 0.000 E : 6049
## Mean : 0.03042 Mean : 0.184 F : 2823
## 3rd Qu.: 0.00000 3rd Qu.: 0.000 G : 2052
## Max. :26.00000 Max. :72.000 B : 995
## (Other): 1523
## assigned_room_type booking_changes deposit_type agent
## A :46313 Min. : 0.0000 No Deposit:86247 9 :28758
## D :22432 1st Qu.: 0.0000 Non Refund: 1038 240 :13028
## E : 7195 Median : 0.0000 Refundable: 107 NULL :12191
## F : 3627 Mean : 0.2716 14 : 3348
## G : 2498 3rd Qu.: 0.0000 7 : 3300
## C : 2165 Max. :21.0000 250 : 2779
## (Other): 3162 (Other):23988
## company days_in_waiting_list customer_type adr
## NULL :82133 Min. : 0.0000 Contract : 3139 Min. : -6.38
## 40 : 851 1st Qu.: 0.0000 Group : 544 1st Qu.: 72.00
## 223 : 503 Median : 0.0000 Transient :71986 Median : 98.10
## 45 : 238 Mean : 0.7496 Transient-Party:11723 Mean : 106.34
## 153 : 206 3rd Qu.: 0.0000 3rd Qu.: 134.00
## 154 : 133 Max. :391.0000 Max. :5400.00
## (Other): 3328
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Canceled :23007
## 1st Qu.:0.00000 1st Qu.:0.0000 Check-Out:63371
## Median :0.00000 Median :0.0000 No-Show : 1014
## Mean :0.08423 Mean :0.6985
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-03-18
## Median :2016-09-08
## Mean :2016-08-31
## 3rd Qu.:2017-03-05
## Max. :2017-09-14
##
The summary statistics above shows much more meaningful information now. The columns which has been converted to factors are now showing count of observations for each level.
It also indicates that the adr column has some negative values as well which is not possible which shows data is incorrect and needs to be further investigated.
Some of the bookings has adults column as 0 which is legally not possible. There cannot be any booking made just for children and babies when they are not accompanied by any adults. This shows that there is some error in this data.
# Checking Different values of Meal column
levels(Hotels_Data$meal)
## [1] "BB" "FB" "HB" "SC"
From above, we can see that the meals column has been corrected to have only four values.
# Finding negative value in adr column
nrow(subset(Hotels_Data, Hotels_Data$adr < 0))
## [1] 1
We can see that there is only 1 row having negative adr which makes no sense as if a room is booked, it will have some positive or zero (in case of No Deposit) average daily rate. We can drop off these records for our further analysis.
# Removing adr columns with zero or negative values
Hotels_Data <- Hotels_Data %>% filter(Hotels_Data$adr > 0 | Hotels_Data$adr == 0)
So, we have truncated our dataset to 87391 rows by removing observations having negative or zero adr.
# Finding number of observations when no adult checked in
nrow(subset(Hotels_Data, Hotels_Data$adults == 0))
## [1] 385
# Finding number of observations when no adults, children and babies checked in
nrow(subset(Hotels_Data, Hotels_Data$babies == 0 & Hotels_Data$adults == 0 & Hotels_Data$children == 0))
## [1] 166
# Finding number of observations when only babies checked in
nrow(subset(Hotels_Data, Hotels_Data$babies != 0 & Hotels_Data$adults == 0 & Hotels_Data$children == 0))
## [1] 0
In the above result, we have 385 records when no adults checked-in. It might seem impractical at first go but we cannot consider it as incorrect data. For example, a person who is 17 years old can book a hotel but will be categorized under children column.
However, we can see there are 166 observations where no adult, babies or children have checked-in. This doesn’t seem right and can be easily dropped off from the rest of the dataset.
# Removing observations having no adults, children or babies checked-in
Hotels_Data <- Hotels_Data %>% filter(Hotels_Data$adults != 0 | Hotels_Data$adults != 0 | Hotels_Data$children != 0)
Our dataset is now trimmed down to 87225 observations.
Outlier Detection
Another very important aspect of data cleaning is to find out the outliers present in the data. These outliers may be due to incorrect data or due to the extreme values. We are going to find out the outliers of each numeric column to determine whether to keep or remove those outliers.
We have used boxplots to determine the outliers of each numeric column.
# Creating boxplot of all numeric variables
boxplot(Hotels_Data$lead_time, ylab = "lead_time",main = "Boxplot of lead_time")
boxplot(Hotels_Data$arrival_date_week_number, ylab = "arrival_date_week_number",main = "Boxplot of arrival_date_week_number")
boxplot(Hotels_Data$stays_in_weekend_nights, ylab = "stays_in_weekend_nights",main = "Boxplot of stays_in_weekend_nights")
boxplot(Hotels_Data$stays_in_week_nights, ylab = "stays_in_week_nights",main = "Boxplot of stays_in_week_nights")
boxplot(Hotels_Data$adults, ylab = "adults",main = "Boxplot of adults")
boxplot(Hotels_Data$children, ylab = "children",main = "Boxplot of children")
boxplot(Hotels_Data$babies, ylab = "babies",main = "Boxplot of babies")
boxplot(Hotels_Data$previous_cancellations, ylab = "previous_cancellations",main = "Boxplot of previous_cancellations")
boxplot(Hotels_Data$previous_bookings_not_canceled, ylab = "previous_bookings_not_canceled",main = "Boxplot of previous_bookings_not_canceled")
boxplot(Hotels_Data$booking_changes, ylab = "booking_changes",main = "Boxplot of booking_changes")
boxplot(Hotels_Data$days_in_waiting_list, ylab = "days_in_waiting_list",main = "Boxplot of days_in_waiting_list")
boxplot(Hotels_Data$adr, ylab = "adr",main = "Boxplot of Average_Daily_Rate")
boxplot(Hotels_Data$required_car_parking_spaces, ylab = "required_car_parking_spaces",main = "Boxplot of required_car_parking_spaces")
boxplot(Hotels_Data$total_of_special_requests, ylab = "total_of_special_requests",main = "Boxplot of total_of_special_requests")
From boxplots above, we can visualize that there are outliers present for all numeric columns except “arrival_date_week_number” column. At this point, these outliers seems to be something originating from extreme values and not due to error. Thus, we will keep them in our dataset.
After necessary cleaning, we have our final dataset ready to use for Exploratory Data Analysis. Let’s take a look at our final dataset.
DT::datatable(head(Hotels_Data,100))
We have two problem statements that we are looking to resolve. Both the problem statements are actually intertwined:
To explore how much booking cancellations are done for both the hotels,what are the factors that are contributing to booking cancellations, what kind of customers are doing cancellations etc.
To determine the revenue generated for both the hotels, What kind of customers are contributing more towards revenue generation, how booking cancellations is impacting revenue generation etc.
Let’s dive into it:
#To find number of booking cancellations for each hotel type
Hotels_Data %>%
group_by(hotel,is_canceled) %>%
summarise(length(is_canceled))
## `summarise()` has grouped output by 'hotel'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups: hotel [2]
## hotel is_canceled `length(is_canceled)`
## <fct> <fct> <int>
## 1 City Hotel 0 37239
## 2 City Hotel 1 16031
## 3 Resort Hotel 0 25981
## 4 Resort Hotel 1 7974
From above result, we can see that booking cancellations of City Hotel is almost 30% and that of Resort Hotel is 23%. Thus, City Hotel has slightly more cancellations than Resort Hotel but this is not significant difference in terms of cancelled bookings and both the hotels should try to minimize their cancellations.
We can visualize the above result from a bar chart as well.
BookingStatus_bar <- function(behavior){
Hotels_Data %>%
ggplot(aes(is_canceled, fill = behavior)) +
geom_bar(position = "fill") +
labs(title = "Behavior of Booking Status",
subtitle = behavior,
x = "Booking Status (1 for Cancelled)",
y = "Percentage of Cancellation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "blue"))
}
BookingStatus_bar(Hotels_Data$hotel)
# Total Number of cancellations for both existing customers and new customers
Hotels_Data %>%
group_by(is_repeated_guest,is_canceled) %>%
summarise(length(is_canceled))
## `summarise()` has grouped output by 'is_repeated_guest'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups: is_repeated_guest [2]
## is_repeated_guest is_canceled `length(is_canceled)`
## <fct> <fct> <int>
## 1 0 0 60117
## 2 0 1 23745
## 3 1 0 3103
## 4 1 1 260
From above, it is clear that number of bookings cancelled for new customers is 23745 (approximately 28%) while that of existing customers is 260 (approximately 7%). Thus, it is evident that the existing customers tend to trust the hotels but new customer struggles with that.
We can visualize the above result using a barchart below.Here 0 represents a New Customer and 1 represents a existing customer.
BookingStatus_bar(Hotels_Data$is_repeated_guest)
# Effect of Lead Time on Booking Cancellation
Hotels_Data %>%
group_by(is_canceled) %>%
summarise(mean = mean(lead_time))
## # A tibble: 2 x 2
## is_canceled mean
## <fct> <dbl>
## 1 0 70.2
## 2 1 106.
From above result, we can see that when average lead time is 105, customers tend to cancel their bookings.However, bookings continue to prevail if average lead time is 70. This indicates that hotels should try to minimize their lead time and try to keep it within less than 3 months.
# Booking Status based on Deposit Type
BookingStatus_bar(Hotels_Data$deposit_type)
From above bar graph, we can see that cancellations are done even if it is non-refundable. Thus, Deposit_Type is not our variable of concern.Infact, more cancellations are done on Non-Refundable type contrary to our popular belief.
table(Hotels_Data$deposit_type, Hotels_Data$is_canceled)
##
## 0 1
## No Deposit 63084 22996
## Non Refund 55 983
## Refundable 81 26
The statistics of booking status based on Deposit type is displayed above.
# Impact of Wait Period on Booking Cancellation
Hotels_Data %>%
group_by(is_canceled) %>%
summarise(mean(days_in_waiting_list))
## # A tibble: 2 x 2
## is_canceled `mean(days_in_waiting_list)`
## <fct> <dbl>
## 1 0 0.717
## 2 1 0.823
From above, we can see more cancellations are done when average wait period is more than 82 days.
# Impact of Market Segment on Cancelled Bookings
BookingStatus_bar(Hotels_Data$market_segment)
From above graph, it is quite evident, Online TA market segment is contributing towards more booking cancellations.
# Impact of Distribution Channel on Cancelled Bookings
BookingStatus_bar(Hotels_Data$distribution_channel)
From above graph, we can see more cancellations are done through TA/TO distribution channel.
#Impact of previous Cancellation on Booking Cancellation
Hotels_Data %>%
group_by(is_canceled) %>%
summarise(mean = mean(previous_cancellations))
## # A tibble: 2 x 2
## is_canceled mean
## <fct> <dbl>
## 1 0 0.0187
## 2 1 0.0613
As indicated above, those who have done cancellations previously tend to cancel more.
#Impact of previous boking changes on Booking Cancellation
Hotels_Data %>%
group_by(is_canceled) %>%
summarise(mean = mean(booking_changes))
## # A tibble: 2 x 2
## is_canceled mean
## <fct> <dbl>
## 1 0 0.309
## 2 1 0.161
Above tibble shows that those who do more booking changes tend to book hotels actually. But it does not give us much information. Thus, it is not a variable of our concern.
# Behavior of Customer Type on Booking Cancellation
BookingStatus_bar(Hotels_Data$customer_type)
Transient customer type tends to do more boking cancellations.
Hotels_Data %>%
group_by(is_canceled) %>%
summarise(mean(total_of_special_requests))
## # A tibble: 2 x 2
## is_canceled `mean(total_of_special_requests)`
## <fct> <dbl>
## 1 0 0.761
## 2 1 0.536
Above result does not give much information about the impact of special requests on booking cancellations. Thus, we can remove it from our further analysis.
Now, to deal with our second problem statement to find the revenue generation of each hotel, we have to first create a variable to store revenue earned from each booking.
# To add a column showing revenue per booking
Hotels_Data <- Hotels_Data %>%
mutate(Revenue_Per_Booking = adr * (stays_in_weekend_nights + stays_in_week_nights))
dim(Hotels_Data)
## [1] 87225 33
As shown above, our new dataset has one new column added to it i.e. Revenue_Per_Booking. Now, we have 33 variables.
# Summary Statistics of Revenue_Per_Booking
summary(Hotels_Data$Revenue_Per_Booking)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 153.0 299.2 395.0 504.0 7590.0
The above stat indicates that the average revenue generated per booking is 395$.Minimum value is 0 which might be the case where bookings are done for “No Deposit” type. Hotels should come up with some strategy to avoid this as it will result in revenue loss if customers book it with no deposit type but don’t show up.
Based on our exploratory data analysis, below are some of the insights we gained:
There are more cancellations in City Hotel (30%) than in Resort Hotel (23%) but this is not a significant difference and both the hotels need to convert these cancellations into successful bookings.
Lead time over 3 months lead to cancellation of bookings.
New Customers tend to do more cancellations than the existing customers. Hotels should come up with some discounts or professional offers to acquire new customers and provide good hospitality to gain their trust.
Having No Deposit Type doesn’t necessarily result in successful booking. Infact, Non-refundable type seems to have more cancellation than successful booking which is quite unusual.
If wait period is more than 82 days, cancellations are more frequent.
The customers who have done cancellations before tend to do more cancellation.
The customers who have done more booking changes tend to do successful booking.
There is more cancellation for market segment Online TA and distribution channel TA/TO.
The average revenue generated per booking is $395
What’s Next?
Our next step is to create a correlation matrix to find correlation among different variables to drop highly correlated variables from our analysis. Our problem is a classification problem. We will use linear regression to find dependency of booking cancellation on other variables to determine the best fitted line. We will also try to use some modeling technique to predict the booking cancellation.