Introduction

The purpose of the project is to analyze the factors that affect such as hotel booking and cancellation, as well as guest retention, that will ultimately be used to boost hotel revenue.
For these purpose, hotel booking dataset was used in this project. This data includes two types of hotels: resort hotel (H1) and city hotel (H2).
I will also try to obtain more information about the guests checking such as when they booked their reservation, and any insights that could be helpful in understanding the client base. In addition, I will also evaluate subgroup of these reservations such as guests who chose to cancel their stays to see if there are any underlying factors that may contribute to guest cancellations.
With these insights, hotels could make better data-driven decisions that could finally increase their revenue.

Packages Required

library(ggplot2) # For graphs in Explorartory Data Analysis
## Warning: package 'ggplot2' was built under R version 3.6.3
library(dplyr)   # For Data Manipulation like select(),mutate(),groupby()
## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(scales)  #Automatically determine breaks and labels for axes and legends. 
## Warning: package 'scales' was built under R version 3.6.3
library(tidyverse) #Meta Package to load other necessary Packages like ggplot2,dplyr,readr
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.1     v purrr   0.3.4
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard()    masks scales::discard()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
library(readr)     #For reading contents from a file
library(e1071)      #For naive bayes,SVM models
## Warning: package 'e1071' was built under R version 3.6.3
library(caret)      #Misc. functions for training and plotting classification and regression models 
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(rpart)      #For building decision tree models
## Warning: package 'rpart' was built under R version 3.6.2
library(rpart.plot) #For Plotting decision Trees
## Warning: package 'rpart.plot' was built under R version 3.6.3

Data Preparation

The data was cited from the original article Hotel Booking Demand Datasets, authored by Nuno Antonio, Ana Almeida, and Luis Nunes published in the journal Data in Brief, Volume 22, February 2019.
This data set contains booking information for resort hotel (H1) and city hotel (H2), and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. Both data sets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted.

Loading data

hotel <- read_csv("C:/BANA/Data_analytics_program/Summer_2021/BANA_7025_DATA_WRANGLING/Middle_term_YGY/datasets/hotels/hotels.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   hotel = col_character(),
##   arrival_date_month = col_character(),
##   meal = col_character(),
##   country = col_character(),
##   market_segment = col_character(),
##   distribution_channel = col_character(),
##   reserved_room_type = col_character(),
##   assigned_room_type = col_character(),
##   deposit_type = col_character(),
##   agent = col_character(),
##   company = col_character(),
##   customer_type = col_character(),
##   reservation_status = col_character(),
##   reservation_status_date = col_date(format = "")
## )
## i Use `spec()` for the full column specifications.

Check the size and summarize the dataset

dim(hotel)
## [1] 119390     32
summary(hotel)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 

To display the column names of the dataset more clearly

colnames(hotel)
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"

Check the number of null values in each column

colSums(is.na(hotel))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

Remove NA and other undefined value

hotel$meal[hotel$meal=='Undefined'] <- 'SC'
hotel$children[is.na(hotel$children)] <- 0
hotel <- subset(hotel, market_segment!='Undefined')
hotel <- subset(hotel, distribution_channel!='Undefined')  

Exploratory Data Analysis

Check the number of booking at respective hotels

table(hotel$hotel)
## 
##   City Hotel Resort Hotel 
##        79326        40059

The above table clearly showed that more City Hotel booking was included than that of Resort Hotels in this dataset.

Check the distribution of hotel type for cancellation

table(hotel$is_canceled, hotel$hotel)
##    
##     City Hotel Resort Hotel
##   0      46228        28937
##   1      33098        11122

It was really surprisingly that for both of the hotels, the percentage of cancellation was pretty high, especially for the City Hotel.

Length stayed

Next, I tried to check how long the guests stay. The dataset includes weekends and weekdays, so I combined the two columns into a vector and examined the total time.

weekend <- hotel$stays_in_weekend_nights
weekday <- hotel$stays_in_week_nights
total <- weekday + weekend
total[total == 0] <- NA
summary(total)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.000   3.000   3.449   4.000  69.000     715
total1 <-as.factor(total)
plot(total1)

From the figure above we can conclude that most of guests stayed for 1 - 4 days.

Check the number of guests for each booking

adult <- hotel$adults
child <- hotel$children
baby <- hotel$babies
total_guest <- adult + baby + child
total_guest[total_guest == 0] <- NA
summary(total_guest)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.000   2.000   1.971   2.000  55.000     180

From the above output, it was hard to say what’s the number of the most guest per booking, so we plot the number of guest per booking.

guest <-as.factor(total_guest)
plot(guest,ylab="Number of Booking",xlab="Number of Guests")

From the results above, it was clear shown that most of the booking have two guests.

Stays by Month

I continued to check the distribution of hotel bookings on a monthly basis and plotted the bookings as following:

month <-as.factor(hotel$arrival_date_month)
month<-factor(month, levels=c("January","February","March","April","May","June","July","August","September","October","November","December"))
plot(month)

From the figure above, we can generally see that guests prefer to book the hotel during warm/hot seasons.

To check if there is any specific day of a month the guests prefer to arrive.

day <- as.factor(hotel$arrival_date_day_of_month)
plot(day, ylab = "Nunber of Guest Arrival", xlab = "Day of the Month")

From the above figure we can see that there is no clear trend that guests prefer to arrive at specific day(s) of a month. The reason that the 31st day of the month has significantly less arrivals than other days is because not every month has the 31th day.

More coming soon

Above are just some raw data pre cleaning and preliminary analysis of the dataset. Toward the final project, I am working on establishing models to analyze various factors(variables) that affect guest booking cancellation as well as guest retention.