Hotels_MidTerm-Project-Group-15-Final.knit

Hotels Booking Demand

Authors: Fnu Jagriti | Sai Seetha Ram Anne | Deepthi Rajagopal

Introduction

The future of the hotel and hospitality industries is a hot topic, and the paths to success are potentially many. The hotel industry has been hit hard due to the corona virus pandemic which left a grave impact on society. Consumer spending behavior has changed and will continue to change, stipulating business plans to adapt to these new circumstances. Therefore, it is very important for hotel industries to gain a leg up on their competitors by evolving their business strategy to acquire new customers and retain existing customers. This is possible by analyzing the real-time consumer data to understand customer behavior, finding the gaps and trends in their booking history, and tailoring the business strategy accordingly.

Problem Statement

Our data set contains hotel demand data of two hotels in Portugal: Resort Hotel of Algarve and City Hotel of Lisbon. This data set has 31 variables describing 40,060 observations of Resort hotel and 79,330 observations of City Hotel i.e. we have almost double records for City Hotel than Resort Hotel.

The primary purpose of our analysis is to provide insights for effective revenue management by predicting the number of booking cancellations for each hotel type over time based on the current trend and by implementing plans to minimize such cancellations.

Our analysis aims to resolve two problem statements which aims at answering some burning questions:

Problem Statement 1:

To determine the number of bookings that are cancelled for each hotel type. It is important to minimize this as hotels can’t afford to lose their existing customers as it is always easier to retain the existing ones than to acquire new ones.
To explore the reason for their cancellation so that it can be avoided from the current customers.
To find the correlation among other variables and drop columns which are highly correlated and do not contribute to any significant information.
To develop a prediction model to determine the likelihood of booking cancellation so that it can be avoided by putting appropriate action plan in place.

Problem Statement 2:

To create new variable that determines the average revenue generation for each hotel. It helps in determining which hotel is most preferred and what needs to done to hotel to maintain a competitive edge.
To find the seasonality trend of revenue generation so that hotels can do strategic plannings to make up for the loss in a month/year in next ones.

Methodology

Our approach is to determine the number of cancellations done in both the hotels and to find the dependency of the variables on other variables present in the data set. This helps in determining the factors which can be controlled to minimize the booking cancellations. Further,once we have determined the variables, we can apply our mode - decision tree or random forest to predict the booking cancellations by training our model with the current data. Additionally, we will also try to find out the revenue generation of each hotel and the factors which drive it, so that we can improve factors to minimize booking cancellations to maximize the revenue generation.

We will start with the data pre-processing to convert our raw data into appropriate formats and remove any duplicate values and insignificant data. We will then clean the data by looking for NA’s or missing values and then imputing them or dropping them as required. We will then do exploratory data analysis such as checking summary statistics, converting data type of variables, finding skewness of each variables to determine the outliers etc. We will use different visualizations to gauge different metrics associated with the booking cancellations and revenue generation. Finally, we will create our model to predict the booking cancellations.

Consumer Benefit

The key insights derived from this analysis will help the hotels in determining the bookings which are likely to be cancelled and they can provide discounts and offers based on the consumer behavior to avoid such cancellations. Also, the analysis will also help in determining the factors which drives revenue generation and how the factors undermining it can be improved upon to maximize their earnings.

Package Requirements

Packages Required

To reproduce the codes and results throughout this project without any errors, we need following packages:

library(tidyverse)    #Data Manipulation

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tibble)       #Creating Tibbles
library(tidyr)        #Creating Tidy Data
library(dplyr)        #Data Analysis
library(DT)           #Display Data set
library(ggplot2)      #Data Visualization
library(magrittr)     #Piping Operator %>%

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(knitr)        #To Display tables

Data Preparation

Data Source

The dataset we are using for our analysis comes from an open hotel booking demand Dataset present at Github repository.

Our dataset is a combination of demand data of two hotels. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). The dataset has 32 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. It shows bookings done between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.

Data Dictionary

Data Dictionary gives a description of each variable i.e. it tells what does each of the columns in the dataset represent. It is very helpful in understanding our data better and recognizing any anomalies in dataset.

options(warn = -1)

Variable_Name <- c("hotel","is_canceled","lead_time","arrival_date_year","arrival_date_month","arrival_date_week_number","arrival_date_day_of_month","stays_in_weekend_nights","stays_in_week_nights","adults","children","babies","meal","country","market_segment","distribution_channel","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","reserved_room_type","assigned_room_type","booking_changes","deposit_type","agent","company","days_in_waiting_list","customer_type","adr","required_car_parking_spaces","total_of_special_requests","reservation_status","reservation_status_date")

Variable_Description <- c("Hotel (H1 = Resort Hotel or H2 = City Hotel)","Value indicating if the booking was canceled (1) or not (0)","Number of days that elapsed between the entering date of the booking into the PMS and the arrival date","Year of arrival date","Month of arrival date","Week number of year for arrival date","Day of arrival date","Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel","Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel","Number of adults","Number of children","Number of babies","Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)","Country of origin. Categories are represented in the ISO 3155–3:2013 format","Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”","Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”","Value indicating if the booking name was from a repeated guest (1) or not (0)",
"Number of previous bookings that were cancelled by the customer prior to the current booking","Number of previous bookings not cancelled by the customer prior to the current booking","Code of room type reserved. Code is presented instead of designation for anonymity reasons","Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons","Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation","Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.","ID of the travel agency that made the booking", "ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons","Number of days the booking was in the waiting list before it was confirmed to the customer","Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking","Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights","Number of car parking spaces required by the customer","Number of special requests made by the customer (e.g. twin bed or high floor)","Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why","Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel")


Data_Dictionary <- as_data_frame(cbind(Variable_Name, Variable_Description))

Data_Dictionary

## # A tibble: 32 x 2
##    Variable_Name             Variable_Description                               
##    <chr>                     <chr>                                              
##  1 hotel                     Hotel (H1 = Resort Hotel or H2 = City Hotel)       
##  2 is_canceled               Value indicating if the booking was canceled (1) o~
##  3 lead_time                 Number of days that elapsed between the entering d~
##  4 arrival_date_year         Year of arrival date                               
##  5 arrival_date_month        Month of arrival date                              
##  6 arrival_date_week_number  Week number of year for arrival date               
##  7 arrival_date_day_of_month Day of arrival date                                
##  8 stays_in_weekend_nights   Number of weekend nights (Saturday or Sunday) the ~
##  9 stays_in_week_nights      Number of week nights (Monday to Friday) the guest~
## 10 adults                    Number of adults                                   
## # ... with 22 more rows

Data Import and Data Cleaning

Data Import

We have read our csv file and found that there are 119390 rows and 32 columns in the dataset.

Hotels_Data <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')

## Rows: 119390 Columns: 32

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning

Missing Values Check:

#To check for missing values in each column

colSums(is.na(Hotels_Data))

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

From above, we can see we have 4 missing values for children column and no missing values for rest of the variables.

Removing NA or Missing Value Records:

Hotels_Data <- na.omit(Hotels_Data)
Hotels_Data

## # A tibble: 119,386 x 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl>             <dbl> <chr>             
##  1 Resort Hotel           0       342              2015 July              
##  2 Resort Hotel           0       737              2015 July              
##  3 Resort Hotel           0         7              2015 July              
##  4 Resort Hotel           0        13              2015 July              
##  5 Resort Hotel           0        14              2015 July              
##  6 Resort Hotel           0        14              2015 July              
##  7 Resort Hotel           0         0              2015 July              
##  8 Resort Hotel           0         9              2015 July              
##  9 Resort Hotel           1        85              2015 July              
## 10 Resort Hotel           1        75              2015 July              
## # ... with 119,376 more rows, and 27 more variables:
## #   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, ...

Since, we have just 4 missing values in children column, we have removed those records as it will not hinder our further analysis. Now, we have 119386 records present in our dataset.

Removing Duplicate Records:

Sometimes, we have same records present in the dataset multiple times. This causes sampling bias and hence, we remove any duplicate records present in the dataset.

Hotels_Data <- unique(Hotels_Data)

The total unique observations has become 87392 which shows that approximately 27% of our records were duplicate which were dropped off.

Checking for Correctness of Data in Each Column

From the data dictionary, we can see column meal has two values SC and Undefined which basically represent the same thing i.e.no meal package. Thus, we replace “Undefined” with “SC”.

# Replacing the two values representing the same thing by a single value

Hotels_Data$meal <- replace(Hotels_Data$meal,Hotels_Data$meal == 'Undefined','SC')

# Structure of Dataset

str(Hotels_Data)

## tibble [87,392 x 32] (S3: tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:87392] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:87392] 0 0 0 0 0 0 0 1 1 1 ...
##  $ lead_time                     : num [1:87392] 342 737 7 13 14 0 9 85 75 23 ...
##  $ arrival_date_year             : num [1:87392] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:87392] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:87392] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:87392] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:87392] 0 0 1 1 2 2 2 3 3 4 ...
##  $ adults                        : num [1:87392] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:87392] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:87392] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:87392] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:87392] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:87392] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:87392] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:87392] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:87392] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:87392] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:87392] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:87392] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:87392] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:87392] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:87392] 0 0 0 0 1 0 1 1 0 0 ...
##  $ reservation_status            : chr [1:87392] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:87392], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "na.action")= 'omit' Named int [1:4] 40601 40668 40680 41161
##   ..- attr(*, "names")= chr [1:4] "40601" "40668" "40680" "41161"

From the structure of our dataset, we can see a lot of categorical data is of character data type and do not provide much information in this state. Summary() function handles factors differently than numbers (and strings), the occurrence counts for each value is often more useful information. Therefore, our next step is to convert categorical and binary variables into factors and find the summary statistics.

# Conversion of Categorical and Binary Variables to Factors

Hotels_Data <- Hotels_Data %>%
  mutate(hotel = as.factor(hotel),      
         is_canceled = as.factor(is_canceled),
         arrival_date_day_of_month = as.factor(arrival_date_day_of_month),
         arrival_date_month = as.factor(arrival_date_month),
         arrival_date_year = as.factor(arrival_date_year),
         meal = as.factor(meal),
         country = as.factor(country),
         market_segment = as.factor(market_segment),
         distribution_channel = as.factor(distribution_channel),
         is_repeated_guest = as.factor(is_repeated_guest),
         reserved_room_type = as.factor(reserved_room_type),
         assigned_room_type = as.factor(assigned_room_type),
         deposit_type = as.factor(deposit_type),
         customer_type = as.factor(customer_type),
         reservation_status = as.factor(reservation_status),
         agent = as.factor(agent),
         company = as.factor(company)
         )

# Finding glimpse of Hotels Dataset after conversion of variables to factors

glimpse(Hotels_Data)

## Rows: 87,392
## Columns: 32
## $ hotel                          <fct> Resort Hotel, Resort Hotel, Resort Hote~
## $ is_canceled                    <fct> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, ~
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 0, 9, 85, 75, 23, ~
## $ arrival_date_year              <fct> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month             <fct> July, July, July, July, July, July, Jul~
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, ~
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal                           <fct> BB, BB, BB, BB, BB, BB, FB, BB, HB, BB,~
## $ country                        <fct> PRT, PRT, GBR, GBR, GBR, PRT, PRT, PRT,~
## $ market_segment                 <fct> Direct, Direct, Direct, Corporate, Onli~
## $ distribution_channel           <fct> Direct, Direct, Direct, Corporate, TA/T~
## $ is_repeated_guest              <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type             <fct> C, C, A, A, A, C, C, A, D, E, D, D, G, ~
## $ assigned_room_type             <fct> C, C, C, A, A, C, C, A, D, E, D, E, G, ~
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ deposit_type                   <fct> No Deposit, No Deposit, No Deposit, No ~
## $ agent                          <fct> NULL, NULL, NULL, 304, 240, NULL, 303, ~
## $ company                        <fct> NULL, NULL, NULL, NULL, NULL, NULL, NUL~
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type                  <fct> Transient, Transient, Transient, Transi~
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 107.00~
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 3, 1, ~
## $ reservation_status             <fct> Check-Out, Check-Out, Check-Out, Check-~
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~

# Summary Statistics

summary(Hotels_Data)

##           hotel       is_canceled   lead_time      arrival_date_year
##  City Hotel  :53424   0:63371     Min.   :  0.00   2015:13309       
##  Resort Hotel:33968   1:24021     1st Qu.: 11.00   2016:42391       
##                                   Median : 49.00   2017:31692       
##                                   Mean   : 79.89                    
##                                   3rd Qu.:125.00                    
##                                   Max.   :737.00                    
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  August :11253      Min.   : 1.00            17     : 3020            
##  July   :10057      1st Qu.:16.00            2      : 3016            
##  May    : 8355      Median :27.00            26     : 3000            
##  April  : 7908      Mean   :26.84            5      : 2978            
##  June   : 7765      3rd Qu.:37.00            16     : 2959            
##  March  : 7513      Max.   :53.00            19     : 2949            
##  (Other):34541                               (Other):69470            
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.000          Min.   : 0.000       Min.   : 0.000  
##  1st Qu.: 0.000          1st Qu.: 1.000       1st Qu.: 2.000  
##  Median : 1.000          Median : 2.000       Median : 2.000  
##  Mean   : 1.005          Mean   : 2.625       Mean   : 1.876  
##  3rd Qu.: 2.000          3rd Qu.: 4.000       3rd Qu.: 2.000  
##  Max.   :19.000          Max.   :50.000       Max.   :55.000  
##                                                               
##     children           babies         meal          country     
##  Min.   : 0.0000   Min.   : 0.00000   BB:67974   PRT    :27449  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   FB:  360   GBR    :10433  
##  Median : 0.0000   Median : 0.00000   HB: 9085   FRA    : 8837  
##  Mean   : 0.1386   Mean   : 0.01082   SC: 9973   ESP    : 7252  
##  3rd Qu.: 0.0000   3rd Qu.: 0.00000              DEU    : 5387  
##  Max.   :10.0000   Max.   :10.00000              ITA    : 3066  
##                                                  (Other):24968  
##        market_segment  distribution_channel is_repeated_guest
##  Aviation     :  227   Corporate: 5081      0:83977          
##  Complementary:  702   Direct   :12988      1: 3415          
##  Corporate    : 4212   GDS      :  181                       
##  Direct       :11803   TA/TO    :69141                       
##  Groups       : 4942   Undefined:    1                       
##  Offline TA/TO:13889                                         
##  Online TA    :51617                                         
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.000                 A      :56552     
##  1st Qu.: 0.00000       1st Qu.: 0.000                 D      :17398     
##  Median : 0.00000       Median : 0.000                 E      : 6049     
##  Mean   : 0.03042       Mean   : 0.184                 F      : 2823     
##  3rd Qu.: 0.00000       3rd Qu.: 0.000                 G      : 2052     
##  Max.   :26.00000       Max.   :72.000                 B      :  995     
##                                                        (Other): 1523     
##  assigned_room_type booking_changes       deposit_type       agent      
##  A      :46313      Min.   : 0.0000   No Deposit:86247   9      :28758  
##  D      :22432      1st Qu.: 0.0000   Non Refund: 1038   240    :13028  
##  E      : 7195      Median : 0.0000   Refundable:  107   NULL   :12191  
##  F      : 3627      Mean   : 0.2716                      14     : 3348  
##  G      : 2498      3rd Qu.: 0.0000                      7      : 3300  
##  C      : 2165      Max.   :21.0000                      250    : 2779  
##  (Other): 3162                                           (Other):23988  
##     company      days_in_waiting_list         customer_type        adr         
##  NULL   :82133   Min.   :  0.0000     Contract       : 3139   Min.   :  -6.38  
##  40     :  851   1st Qu.:  0.0000     Group          :  544   1st Qu.:  72.00  
##  223    :  503   Median :  0.0000     Transient      :71986   Median :  98.10  
##  45     :  238   Mean   :  0.7496     Transient-Party:11723   Mean   : 106.34  
##  153    :  206   3rd Qu.:  0.0000                             3rd Qu.: 134.00  
##  154    :  133   Max.   :391.0000                             Max.   :5400.00  
##  (Other): 3328                                                                 
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Canceled :23007   
##  1st Qu.:0.00000             1st Qu.:0.0000            Check-Out:63371   
##  Median :0.00000             Median :0.0000            No-Show  : 1014   
##  Mean   :0.08423             Mean   :0.6985                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-03-18     
##  Median :2016-09-08     
##  Mean   :2016-08-31     
##  3rd Qu.:2017-03-05     
##  Max.   :2017-09-14     
##

The summary statistics above shows much more meaningful information now. The columns which has been converted to factors are now showing count of observations for each level.

It also indicates that the adr column has some negative values as well which is not possible which shows data is incorrect and needs to be further investigated.
Some of the bookings has adults column as 0 which is legally not possible. There cannot be any booking made just for children and babies when they are not accompanied by any adults. This shows that there is some error in this data.

# Checking Different values of Meal column 

levels(Hotels_Data$meal)

## [1] "BB" "FB" "HB" "SC"

From above, we can see that the meals column has been corrected to have only four values.

# Finding negative value in adr column

nrow(subset(Hotels_Data, Hotels_Data$adr < 0))

## [1] 1

We can see that there is only 1 row having negative adr which makes no sense as if a room is booked, it will have some positive or zero (in case of No Deposit) average daily rate. We can drop off these records for our further analysis.

# Removing adr columns with zero or negative values

Hotels_Data <- Hotels_Data %>% filter(Hotels_Data$adr > 0 | Hotels_Data$adr == 0)

So, we have truncated our dataset to 87391 rows by removing observations having negative or zero adr.

# Finding number of observations when no adult checked in
nrow(subset(Hotels_Data, Hotels_Data$adults == 0))

## [1] 385

# Finding number of observations when no adults, children and babies checked in
nrow(subset(Hotels_Data, Hotels_Data$babies == 0 & Hotels_Data$adults == 0 & Hotels_Data$children == 0))

## [1] 166

# Finding number of observations when only babies checked in
nrow(subset(Hotels_Data, Hotels_Data$babies != 0 & Hotels_Data$adults == 0 & Hotels_Data$children == 0))

## [1] 0

In the above result, we have 385 records when no adults checked-in. It might seem impractical at first go but we cannot consider it as incorrect data. For example, a person who is 17 years old can book a hotel but will be categorized under children column.

However, we can see there are 166 observations where no adult, babies or children have checked-in. This doesn’t seem right and can be easily dropped off from the rest of the dataset.

# Removing observations having no adults, children or babies checked-in

Hotels_Data <- Hotels_Data %>% filter(Hotels_Data$adults != 0 | Hotels_Data$adults != 0 | Hotels_Data$children != 0)

Our dataset is now trimmed down to 87225 observations.

Outlier Detection

Another very important aspect of data cleaning is to find out the outliers present in the data. These outliers may be due to incorrect data or due to the extreme values. We are going to find out the outliers of each numeric column to determine whether to keep or remove those outliers.

We have used boxplots to determine the outliers of each numeric column.

# Creating boxplot of all  numeric variables

boxplot(Hotels_Data$lead_time, ylab = "lead_time",main = "Boxplot of lead_time")

boxplot(Hotels_Data$arrival_date_week_number, ylab = "arrival_date_week_number",main = "Boxplot of arrival_date_week_number")

boxplot(Hotels_Data$stays_in_weekend_nights, ylab = "stays_in_weekend_nights",main = "Boxplot of stays_in_weekend_nights")

boxplot(Hotels_Data$stays_in_week_nights, ylab = "stays_in_week_nights",main = "Boxplot of stays_in_week_nights")

boxplot(Hotels_Data$adults, ylab = "adults",main = "Boxplot of adults")

boxplot(Hotels_Data$children, ylab = "children",main = "Boxplot of children")

boxplot(Hotels_Data$babies, ylab = "babies",main = "Boxplot of babies")

boxplot(Hotels_Data$previous_cancellations, ylab = "previous_cancellations",main = "Boxplot of previous_cancellations")

boxplot(Hotels_Data$previous_bookings_not_canceled, ylab = "previous_bookings_not_canceled",main = "Boxplot of previous_bookings_not_canceled")

boxplot(Hotels_Data$booking_changes, ylab = "booking_changes",main = "Boxplot of booking_changes")

boxplot(Hotels_Data$days_in_waiting_list, ylab = "days_in_waiting_list",main = "Boxplot of days_in_waiting_list")

boxplot(Hotels_Data$adr, ylab = "adr",main = "Boxplot of Average_Daily_Rate")

boxplot(Hotels_Data$required_car_parking_spaces, ylab = "required_car_parking_spaces",main = "Boxplot of required_car_parking_spaces")

boxplot(Hotels_Data$total_of_special_requests, ylab = "total_of_special_requests",main = "Boxplot of total_of_special_requests")

From boxplots above, we can visualize that there are outliers present for all numeric columns except “arrival_date_week_number” column. At this point, these outliers seems to be something originating from extreme values and not due to error. Thus, we will keep them in our dataset.

Final Dataset

After necessary cleaning, we have our final dataset ready to use for Exploratory Data Analysis. Let’s take a look at our final dataset.

DT::datatable(head(Hotels_Data,100))

Exploratory Data Analysis

We have two problem statements that we are looking to resolve. Both the problem statements are actually intertwined:

To explore how much booking cancellations are done for both the hotels,what are the factors that are contributing to booking cancellations, what kind of customers are doing cancellations etc.
To determine the revenue generated for both the hotels, What kind of customers are contributing more towards revenue generation, how booking cancellations is impacting revenue generation etc.

Let’s dive into it:

#To find number of booking cancellations for each hotel type

Hotels_Data %>% 
  group_by(hotel,is_canceled) %>% 
  summarise(length(is_canceled))

## `summarise()` has grouped output by 'hotel'. You can override using the `.groups` argument.

## # A tibble: 4 x 3
## # Groups:   hotel [2]
##   hotel        is_canceled `length(is_canceled)`
##   <fct>        <fct>                       <int>
## 1 City Hotel   0                           37239
## 2 City Hotel   1                           16031
## 3 Resort Hotel 0                           25981
## 4 Resort Hotel 1                            7974

From above result, we can see that booking cancellations of City Hotel is almost 30% and that of Resort Hotel is 23%. Thus, City Hotel has slightly more cancellations than Resort Hotel but this is not significant difference in terms of cancelled bookings and both the hotels should try to minimize their cancellations.

We can visualize the above result from a bar chart as well.

BookingStatus_bar <- function(behavior){
   Hotels_Data %>%
    ggplot(aes(is_canceled, fill = behavior)) +
    geom_bar(position = "fill") +
    labs(title = "Behavior of Booking Status", 
         subtitle = behavior,
         x = "Booking Status (1 for Cancelled)", 
         y = "Percentage of Cancellation") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "blue"))
}

BookingStatus_bar(Hotels_Data$hotel)

# Total Number of cancellations for both existing customers and new customers

Hotels_Data %>% 
  group_by(is_repeated_guest,is_canceled) %>% 
  summarise(length(is_canceled))

## `summarise()` has grouped output by 'is_repeated_guest'. You can override using the `.groups` argument.

## # A tibble: 4 x 3
## # Groups:   is_repeated_guest [2]
##   is_repeated_guest is_canceled `length(is_canceled)`
##   <fct>             <fct>                       <int>
## 1 0                 0                           60117
## 2 0                 1                           23745
## 3 1                 0                            3103
## 4 1                 1                             260

From above, it is clear that number of bookings cancelled for new customers is 23745 (approximately 28%) while that of existing customers is 260 (approximately 7%). Thus, it is evident that the existing customers tend to trust the hotels but new customer struggles with that.

We can visualize the above result using a barchart below.Here 0 represents a New Customer and 1 represents a existing customer.

BookingStatus_bar(Hotels_Data$is_repeated_guest)

# Effect of Lead Time on Booking Cancellation
Hotels_Data %>% 
  group_by(is_canceled) %>% 
  summarise(mean = mean(lead_time))

## # A tibble: 2 x 2
##   is_canceled  mean
##   <fct>       <dbl>
## 1 0            70.2
## 2 1           106.

From above result, we can see that when average lead time is 105, customers tend to cancel their bookings.However, bookings continue to prevail if average lead time is 70. This indicates that hotels should try to minimize their lead time and try to keep it within less than 3 months.

# Booking Status based on Deposit Type

BookingStatus_bar(Hotels_Data$deposit_type)

From above bar graph, we can see that cancellations are done even if it is non-refundable. Thus, Deposit_Type is not our variable of concern.Infact, more cancellations are done on Non-Refundable type contrary to our popular belief.

table(Hotels_Data$deposit_type, Hotels_Data$is_canceled)

##             
##                  0     1
##   No Deposit 63084 22996
##   Non Refund    55   983
##   Refundable    81    26

The statistics of booking status based on Deposit type is displayed above.

# Impact of Wait Period on Booking Cancellation

Hotels_Data %>% 
  group_by(is_canceled) %>% 
  summarise(mean(days_in_waiting_list))

## # A tibble: 2 x 2
##   is_canceled `mean(days_in_waiting_list)`
##   <fct>                              <dbl>
## 1 0                                  0.717
## 2 1                                  0.823

From above, we can see more cancellations are done when average wait period is more than 82 days.

# Impact of Market Segment on Cancelled Bookings

BookingStatus_bar(Hotels_Data$market_segment)

From above graph, it is quite evident, Online TA market segment is contributing towards more booking cancellations.

# Impact of Distribution Channel on Cancelled Bookings

BookingStatus_bar(Hotels_Data$distribution_channel)

From above graph, we can see more cancellations are done through TA/TO distribution channel.

#Impact of previous Cancellation on Booking Cancellation

Hotels_Data %>% 
  group_by(is_canceled) %>% 
  summarise(mean = mean(previous_cancellations))

## # A tibble: 2 x 2
##   is_canceled   mean
##   <fct>        <dbl>
## 1 0           0.0187
## 2 1           0.0613

As indicated above, those who have done cancellations previously tend to cancel more.

#Impact of previous boking changes on Booking Cancellation

Hotels_Data %>% 
  group_by(is_canceled) %>% 
  summarise(mean = mean(booking_changes))

## # A tibble: 2 x 2
##   is_canceled  mean
##   <fct>       <dbl>
## 1 0           0.309
## 2 1           0.161

Above tibble shows that those who do more booking changes tend to book hotels actually. But it does not give us much information. Thus, it is not a variable of our concern.

# Behavior of Customer Type on Booking Cancellation

BookingStatus_bar(Hotels_Data$customer_type)

Transient customer type tends to do more boking cancellations.

Hotels_Data %>% 
  group_by(is_canceled) %>% 
  summarise(mean(total_of_special_requests))

## # A tibble: 2 x 2
##   is_canceled `mean(total_of_special_requests)`
##   <fct>                                   <dbl>
## 1 0                                       0.761
## 2 1                                       0.536

Above result does not give much information about the impact of special requests on booking cancellations. Thus, we can remove it from our further analysis.

Now, to deal with our second problem statement to find the revenue generation of each hotel, we have to first create a variable to store revenue earned from each booking.

# To add a column showing revenue per booking

Hotels_Data <- Hotels_Data %>% 
  mutate(Revenue_Per_Booking = adr * (stays_in_weekend_nights + stays_in_week_nights))

dim(Hotels_Data)

## [1] 87225    33

As shown above, our new dataset has one new column added to it i.e. Revenue_Per_Booking. Now, we have 33 variables.

# Summary Statistics of Revenue_Per_Booking

summary(Hotels_Data$Revenue_Per_Booking)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   153.0   299.2   395.0   504.0  7590.0

The above stat indicates that the average revenue generated per booking is 395$.Minimum value is 0 which might be the case where bookings are done for “No Deposit” type. Hotels should come up with some strategy to avoid this as it will result in revenue loss if customers book it with no deposit type but don’t show up.

Summary Insights

Based on our exploratory data analysis, below are some of the insights we gained:

There are more cancellations in City Hotel (30%) than in Resort Hotel (23%) but this is not a significant difference and both the hotels need to convert these cancellations into successful bookings.
Lead time over 3 months lead to cancellation of bookings.
New Customers tend to do more cancellations than the existing customers. Hotels should come up with some discounts or professional offers to acquire new customers and provide good hospitality to gain their trust.
Having No Deposit Type doesn’t necessarily result in successful booking. Infact, Non-refundable type seems to have more cancellation than successful booking which is quite unusual.
If wait period is more than 82 days, cancellations are more frequent.
The customers who have done cancellations before tend to do more cancellation.
The customers who have done more booking changes tend to do successful booking.
There is more cancellation for market segment Online TA and distribution channel TA/TO.
The average revenue generated per booking is $395

What’s Next?

Our next step is to create a correlation matrix to find correlation among different variables to drop highly correlated variables from our analysis. Our problem is a classification problem. We will use linear regression to find dependency of booking cancellation on other variables to determine the best fitted line. We will also try to use some modeling technique to predict the booking cancellation.