1 INTRODUCTION

1.1 Problem

Travel is an enlightening and eye-opening experience for both adults and children of all ages. However, traveling can also be overwhelming due to unpredictable schedules, jet lag, where to stay et, cl. Taking staying for example, how would people that book the same hotels repeatedly behave? How likely Would they cancel a booking? What kind of meal plan would they choose? How many days ahead of checking in would they book a hotel? In this project, I am going to use a real data to answer these questions and more.

1.2 Data

The original data for this project is from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019. The original data was then cleaned promarily by some data scientist. The source data can be found at tidytuesday. This dataset includes 32 variables, recording the hotel and stay information such as hotel type, arrivel time, number of people, what meal plan they choose and so on.

1.3 Strategy

Explore the behavior the repeated guests by mining the relationship between variable “is_repeated_guest” and several other variables in this dataset. Display the visualization as well as statistics of the results. Dig the insights from the results

1.4 Insights

The project aims to gain interesting insight into customers’ behavior when booking a hotel. These insights can guide hotels to adjust their customer strategies and make preparation for unknown.

2 PACKAGE REQUIRED

The following packages are needed to import and manipulate the data.

library(data.table)   #import data
library(tidyverse)    #tidy data, visualisation, transformation
library(tibble)       #create tibbles
library(plotrix)      #3D Exploded Pie Chart
library(tidyr)        #tidy data
library(dplyr)        #data analysis
library(DT)           #display data set
library(ggplot2)      #data visualization
library(magrittr)     #pipe oprator
library(knitr)        #display tables

3 DATA PREPARATION

3.1 Read data into R

First, I download the data set from website tidytuesday, and then read it into R.

hotels <- fread("E:/Courses/20Spring/Data wrangling/Final/hotels/hotels.csv")

3.2 Look into the data

3.2.1 Data structure

This data set contains 119390 observations and 32 variables. Some of the variables are interger, others are charactor. It includes information about hotel names, lead time, number of children, agent, and much more.

str(hotels)
## Classes 'data.table' and 'data.frame':   119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
##  - attr(*, ".internal.selfref")=<externalptr>

3.2.2 Missing value

By checking missing values for each variables, we can see that there are only 4 missing value for variable “children”, no missing value in other variables.

colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

3.3 Data cleaning

3.3.1 Take care of the missing value

There are four missing values in the variable “children”. They are imputed by the median value of this variable. Considering the big sample size, this approach would be appropriate.

hotels$children[is.na(hotels$children)] <- median(hotels$children, na.rm = TRUE)

3.3.2 Make data makes more sense

The “Undefined” meal are imputed by “SC” because both “Undefined” and “SC” mean the customer choose not to eat at hotel.

hotels$meal <- as.factor(hotels$meal)
levels(hotels$meal) <- c("BB", "FB", "HB", "SC", "SC")

The variable “is_repeated_guest” is a binary, so it should be a factor. Same change needed for the variable “is_canceled”.

hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
hotels$is_canceled <- as.factor(hotels$is_canceled)

3.4 Statistical summary

From the statistic summary, we can find some interesting information. For instance, about 37% of bookings have been canceled. The booking information is from year 2015 to 2017. Half of the bookings for weekdays have customers stay for two nights. 75% of the bookings have customers stay for three nignts. The longest stay is 50 nights. The average previous cancellation is less than one. 75% of the bookings have one special request. 3% of the bookings are made by repeated guests.

summary(hotels)
##     hotel           is_canceled   lead_time   arrival_date_year
##  Length:119390      0:75166     Min.   :  0   Min.   :2015     
##  Class :character   1:44224     1st Qu.: 18   1st Qu.:2016     
##  Mode  :character               Median : 69   Median :2016     
##                                 Mean   :104   Mean   :2016     
##                                 3rd Qu.:160   3rd Qu.:2017     
##                                 Max.   :737   Max.   :2017     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##     children           babies          meal         country         
##  Min.   : 0.0000   Min.   : 0.000000   BB:92310   Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   FB:  798   Class :character  
##  Median : 0.0000   Median : 0.000000   HB:14463   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949   SC:11819                     
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                
##  Max.   :10.0000   Max.   :10.000000                                
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        0:115580         
##  Class :character   Class :character     1:  3810         
##  Mode  :character   Mode  :character                      
##                                                           
##                                                           
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##  assigned_room_type booking_changes   deposit_type      
##  Length:119390      Min.   : 0.0000   Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character  
##                     Mean   : 0.2211                     
##                     3rd Qu.: 0.0000                     
##                     Max.   :21.0000                     
##     agent             company          days_in_waiting_list
##  Length:119390      Length:119390      Min.   :  0.000     
##  Class :character   Class :character   1st Qu.:  0.000     
##  Mode  :character   Mode  :character   Median :  0.000     
##                                        Mean   :  2.321     
##                                        3rd Qu.:  0.000     
##                                        Max.   :391.000     
##  customer_type           adr          required_car_parking_spaces
##  Length:119390      Min.   :  -6.38   Min.   :0.00000            
##  Class :character   1st Qu.:  69.29   1st Qu.:0.00000            
##  Mode  :character   Median :  94.58   Median :0.00000            
##                     Mean   : 101.83   Mean   :0.06252            
##                     3rd Qu.: 126.00   3rd Qu.:0.00000            
##                     Max.   :5400.00   Max.   :8.00000            
##  total_of_special_requests reservation_status reservation_status_date
##  Min.   :0.0000            Length:119390      Length:119390          
##  1st Qu.:0.0000            Class :character   Class :character       
##  Median :0.0000            Mode  :character   Mode  :character       
##  Mean   :0.5714                                                      
##  3rd Qu.:1.0000                                                      
##  Max.   :5.0000

3.5 Data description

The data description is shown below to make a reference whenever needed.

hotels.type <- lapply(hotels, class)
hotels.var_desc <- c("Hotel (H1 = Resort Hotel or H2 = City Hotel)",
                    "Value indicating if the booking was canceled (1) or not (0)",
                    "Number of days that elapsed between the entering date of the booking into the PMS and the arrival date",
                    "Year of arrival date",
                    "Month of arrival date",
                    "Week number of year for arrival date",
                    "Day of arrival date",
                    "Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel",
                    "Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel",
                    "Number of adults",
                    "Number of children",
                    "Number of babies",
                    "Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)",
                    "Country of origin. Categories are represented in the ISO 3155–3:2013 format",
                    "Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
                    "Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
                    "Value indicating if the booking name was from a repeated guest (1) or not (0)",
                    "Number of previous bookings that were cancelled by the customer prior to the current booking",
                    "Number of previous bookings not cancelled by the customer prior to the current booking",
                    "Code of room type reserved. Code is presented instead of designation for anonymity reasons",
                    "Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons",
                    "Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation",
                    "Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.",
                    "ID of the travel agency that made the booking",
                    "ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons",
                    "Number of days the booking was in the waiting list before it was confirmed to the customer",
                    "Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking",
                    "Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights",
                    "Number of car parking spaces required by the customer",
                    "Number of special requests made by the customer (e.g. twin bed or high floor)",
                    "Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why",
                    "Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel"
)
hotels.var_names <- colnames(hotels)
data.description <- as_data_frame(cbind(hotels.var_names, hotels.type, hotels.var_desc))
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
colnames(data.description) <- c("Variable name", "Data Type", "Variable Description")
kable(data.description)
Variable name Data Type Variable Description
hotel character Hotel (H1 = Resort Hotel or H2 = City Hotel)
is_canceled factor Value indicating if the booking was canceled (1) or not (0)
lead_time integer Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year integer Year of arrival date
arrival_date_month character Month of arrival date
arrival_date_week_number integer Week number of year for arrival date
arrival_date_day_of_month integer Day of arrival date
stays_in_weekend_nights integer Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights integer Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults integer Number of adults
children numeric Number of children
babies integer Number of babies
meal factor Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)
country character Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment character Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel character Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest factor Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations integer Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled integer Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type character Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type character Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes integer Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type character Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
agent character ID of the travel agency that made the booking
company character ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list integer Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type character Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
adr numeric Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces integer Number of car parking spaces required by the customer
total_of_special_requests integer Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status character Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date character Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

3.6 Glipse the data

The dataset look clean and tidy now.

hotels %>% 
  head(n = 20)
##            hotel is_canceled lead_time arrival_date_year
##  1: Resort Hotel           0       342              2015
##  2: Resort Hotel           0       737              2015
##  3: Resort Hotel           0         7              2015
##  4: Resort Hotel           0        13              2015
##  5: Resort Hotel           0        14              2015
##  6: Resort Hotel           0        14              2015
##  7: Resort Hotel           0         0              2015
##  8: Resort Hotel           0         9              2015
##  9: Resort Hotel           1        85              2015
## 10: Resort Hotel           1        75              2015
## 11: Resort Hotel           1        23              2015
## 12: Resort Hotel           0        35              2015
## 13: Resort Hotel           0        68              2015
## 14: Resort Hotel           0        18              2015
## 15: Resort Hotel           0        37              2015
## 16: Resort Hotel           0        68              2015
## 17: Resort Hotel           0        37              2015
## 18: Resort Hotel           0        12              2015
## 19: Resort Hotel           0         0              2015
## 20: Resort Hotel           0         7              2015
##     arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  1:               July                       27                         1
##  2:               July                       27                         1
##  3:               July                       27                         1
##  4:               July                       27                         1
##  5:               July                       27                         1
##  6:               July                       27                         1
##  7:               July                       27                         1
##  8:               July                       27                         1
##  9:               July                       27                         1
## 10:               July                       27                         1
## 11:               July                       27                         1
## 12:               July                       27                         1
## 13:               July                       27                         1
## 14:               July                       27                         1
## 15:               July                       27                         1
## 16:               July                       27                         1
## 17:               July                       27                         1
## 18:               July                       27                         1
## 19:               July                       27                         1
## 20:               July                       27                         1
##     stays_in_weekend_nights stays_in_week_nights adults children babies
##  1:                       0                    0      2        0      0
##  2:                       0                    0      2        0      0
##  3:                       0                    1      1        0      0
##  4:                       0                    1      1        0      0
##  5:                       0                    2      2        0      0
##  6:                       0                    2      2        0      0
##  7:                       0                    2      2        0      0
##  8:                       0                    2      2        0      0
##  9:                       0                    3      2        0      0
## 10:                       0                    3      2        0      0
## 11:                       0                    4      2        0      0
## 12:                       0                    4      2        0      0
## 13:                       0                    4      2        0      0
## 14:                       0                    4      2        1      0
## 15:                       0                    4      2        0      0
## 16:                       0                    4      2        0      0
## 17:                       0                    4      2        0      0
## 18:                       0                    1      2        0      0
## 19:                       0                    1      2        0      0
## 20:                       0                    4      2        0      0
##     meal country market_segment distribution_channel is_repeated_guest
##  1:   BB     PRT         Direct               Direct                 0
##  2:   BB     PRT         Direct               Direct                 0
##  3:   BB     GBR         Direct               Direct                 0
##  4:   BB     GBR      Corporate            Corporate                 0
##  5:   BB     GBR      Online TA                TA/TO                 0
##  6:   BB     GBR      Online TA                TA/TO                 0
##  7:   BB     PRT         Direct               Direct                 0
##  8:   FB     PRT         Direct               Direct                 0
##  9:   BB     PRT      Online TA                TA/TO                 0
## 10:   HB     PRT  Offline TA/TO                TA/TO                 0
## 11:   BB     PRT      Online TA                TA/TO                 0
## 12:   HB     PRT      Online TA                TA/TO                 0
## 13:   BB     USA      Online TA                TA/TO                 0
## 14:   HB     ESP      Online TA                TA/TO                 0
## 15:   BB     PRT      Online TA                TA/TO                 0
## 16:   BB     IRL      Online TA                TA/TO                 0
## 17:   BB     PRT  Offline TA/TO                TA/TO                 0
## 18:   BB     IRL      Online TA                TA/TO                 0
## 19:   BB     FRA      Corporate            Corporate                 0
## 20:   BB     GBR         Direct               Direct                 0
##     previous_cancellations previous_bookings_not_canceled
##  1:                      0                              0
##  2:                      0                              0
##  3:                      0                              0
##  4:                      0                              0
##  5:                      0                              0
##  6:                      0                              0
##  7:                      0                              0
##  8:                      0                              0
##  9:                      0                              0
## 10:                      0                              0
## 11:                      0                              0
## 12:                      0                              0
## 13:                      0                              0
## 14:                      0                              0
## 15:                      0                              0
## 16:                      0                              0
## 17:                      0                              0
## 18:                      0                              0
## 19:                      0                              0
## 20:                      0                              0
##     reserved_room_type assigned_room_type booking_changes deposit_type
##  1:                  C                  C               3   No Deposit
##  2:                  C                  C               4   No Deposit
##  3:                  A                  C               0   No Deposit
##  4:                  A                  A               0   No Deposit
##  5:                  A                  A               0   No Deposit
##  6:                  A                  A               0   No Deposit
##  7:                  C                  C               0   No Deposit
##  8:                  C                  C               0   No Deposit
##  9:                  A                  A               0   No Deposit
## 10:                  D                  D               0   No Deposit
## 11:                  E                  E               0   No Deposit
## 12:                  D                  D               0   No Deposit
## 13:                  D                  E               0   No Deposit
## 14:                  G                  G               1   No Deposit
## 15:                  E                  E               0   No Deposit
## 16:                  D                  E               0   No Deposit
## 17:                  E                  E               0   No Deposit
## 18:                  A                  E               0   No Deposit
## 19:                  A                  G               0   No Deposit
## 20:                  G                  G               0   No Deposit
##     agent company days_in_waiting_list customer_type    adr
##  1:  NULL    NULL                    0     Transient   0.00
##  2:  NULL    NULL                    0     Transient   0.00
##  3:  NULL    NULL                    0     Transient  75.00
##  4:   304    NULL                    0     Transient  75.00
##  5:   240    NULL                    0     Transient  98.00
##  6:   240    NULL                    0     Transient  98.00
##  7:  NULL    NULL                    0     Transient 107.00
##  8:   303    NULL                    0     Transient 103.00
##  9:   240    NULL                    0     Transient  82.00
## 10:    15    NULL                    0     Transient 105.50
## 11:   240    NULL                    0     Transient 123.00
## 12:   240    NULL                    0     Transient 145.00
## 13:   240    NULL                    0     Transient  97.00
## 14:   241    NULL                    0     Transient 154.77
## 15:   241    NULL                    0     Transient  94.71
## 16:   240    NULL                    0     Transient  97.00
## 17:     8    NULL                    0      Contract  97.50
## 18:   240    NULL                    0     Transient  88.20
## 19:  NULL     110                    0     Transient 107.42
## 20:   250    NULL                    0     Transient 153.00
##     required_car_parking_spaces total_of_special_requests
##  1:                           0                         0
##  2:                           0                         0
##  3:                           0                         0
##  4:                           0                         0
##  5:                           0                         1
##  6:                           0                         1
##  7:                           0                         0
##  8:                           0                         1
##  9:                           0                         1
## 10:                           0                         0
## 11:                           0                         0
## 12:                           0                         0
## 13:                           0                         3
## 14:                           0                         1
## 15:                           0                         0
## 16:                           0                         3
## 17:                           0                         0
## 18:                           0                         0
## 19:                           0                         0
## 20:                           0                         1
##     reservation_status reservation_status_date
##  1:          Check-Out              2015-07-01
##  2:          Check-Out              2015-07-01
##  3:          Check-Out              2015-07-02
##  4:          Check-Out              2015-07-02
##  5:          Check-Out              2015-07-03
##  6:          Check-Out              2015-07-03
##  7:          Check-Out              2015-07-03
##  8:          Check-Out              2015-07-03
##  9:           Canceled              2015-05-06
## 10:           Canceled              2015-04-22
## 11:           Canceled              2015-06-23
## 12:          Check-Out              2015-07-05
## 13:          Check-Out              2015-07-05
## 14:          Check-Out              2015-07-05
## 15:          Check-Out              2015-07-05
## 16:          Check-Out              2015-07-05
## 17:          Check-Out              2015-07-05
## 18:          Check-Out              2015-07-02
## 19:          Check-Out              2015-07-02
## 20:          Check-Out              2015-07-05

4 EXPLORATION DATA ANALYSIS

4.1 Distribution of guests

From the following table and piechart we can see that 3.19% of the guests are repeated guests, the rest are unrepeated guests. Since people tend not to visit a certain place over and over again, so these numbers make sense.

hotels %>% 
  group_by(is_repeated_guest) %>% 
  count(is_repeated_guest) %>% 
  summarise(percent = round(n/nrow(hotels)*100,2), total = n)
## # A tibble: 2 x 3
##   is_repeated_guest percent  total
##   <fct>               <dbl>  <int>
## 1 0                   96.8  115580
## 2 1                    3.19   3810
slices <- c(sum(hotels$is_repeated_guest=="1"),sum(hotels$is_repeated_guest=="0"))
lbls <- c("Repeated guest","Unrepeated guest")
pie3D(slices,labels=lbls,explode=0.1)  

4.2 Lead time difference between two groups of guests

In order to explore several bahavior of repeated guests, I created the following two functions to disploay the visualization.

guestbehabar <- function(behavior){
   hotels %>%
    ggplot(aes(is_repeated_guest, fill = behavior)) +
    geom_bar(position = "fill") +
    labs(title = "Behavior feature by guest type", 
         subtitle = behavior,
         x = "Guest type (1 for repeated guests)", 
         y = "Percentage of group") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))
}
guestbehaboxplot <- function(behavior){
   hotels %>%
    ggplot(aes(x = is_repeated_guest, y = behavior)) +
    geom_boxplot() +
    geom_jitter(width = .15, alpha = .2) +
    labs(title = "Behavior feature by guest type",
         subtitle = behavior,
         x = "Guest type (1 for repeated guests)", 
         y = "count") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))
}

The lead time for the repeated guests are shorter than that of unrepeated guests significantly. The average lead time for the repeated guests is 31 days, while that for unrepeated guests is 106 days.

guestbehaboxplot(hotels$lead_time)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  summarise(mean = mean(lead_time))
## # A tibble: 2 x 2
##   is_repeated_guest  mean
##   <fct>             <dbl>
## 1 0                 106. 
## 2 1                  30.8

4.3 How offen do repeated guests cancel booking?

43672 unrepeated guests cancel booking, which counts for 37.8% of the total unrepeated guests. On the other hand, 552 repeated guests cancel booking, which only accounts for 14.4% of the repeated group.

guestbehabar (hotels$is_canceled)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(is_canceled == "1") %>% 
  count() -> filcan

hotels %>% 
  group_by(is_repeated_guest) %>% 
  count() -> total

as.data.frame(filcan/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.3778508
## 2                NA 0.1448819

4.4 Do repeated guests cancel bookings previously?

Yes. 75.6% repeated guests don’t have cancelling records. The percentage for unrepeated guests is 95.2%. This makes sense because unrepeated guests have a unrepeating fashion. The mean cancellation for repeated guests and unrepeated guests are 0.47 and 0.07, respectively.

guestbehaboxplot(hotels$previous_cancellations)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(previous_cancellations == 0) %>% 
  count() -> filpre

as.data.frame(filpre/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.9519207
## 2                NA 0.7566929
hotels %>% 
  group_by(is_repeated_guest) %>% 
  summarise(mean = mean(previous_cancellations))
## # A tibble: 2 x 2
##   is_repeated_guest   mean
##   <fct>              <dbl>
## 1 0                 0.0745
## 2 1                 0.470

4.5 How many bookings haven’t been cancelled previously?

74.5% bookings made by repeated guests haven’t been cancelled. The percentage for unrepeated guests is only 0.7%. The average previous bookings that haven’t been canceled for repeated guests and unrepeated guests are 3.59 and 0.02, respectively.

guestbehaboxplot(hotels$previous_bookings_not_canceled)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(previous_bookings_not_canceled > 0) %>% 
  count() -> filno       

as.data.frame(filno/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest           n
## 1                NA 0.006765876
## 2                NA 0.744881890
hotels %>% 
  group_by(is_repeated_guest) %>% 
  summarise(mean = mean(previous_bookings_not_canceled))
## # A tibble: 2 x 2
##   is_repeated_guest   mean
##   <fct>              <dbl>
## 1 0                 0.0234
## 2 1                 3.59

4.6 What type of hotel would repeated guests book?

It turns out that repeated guests prefer City Hotel other than Resort Hotel. However, the percentage of repeated guests that book Resort Hotel is 46.7%, while only 33.1% of unrepeated guests book Resort Hotel.

guestbehabar(hotels$hotel)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(hotel == "Resort Hotel") %>% 
  count() -> filhot       

as.data.frame(filhot/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.3312165
## 2                NA 0.4666667

4.7 Do repeated guests make change of booking?

81.1% of repeated guests don’t make change of booking. This number is slightly lower that that of unrepeated guests.

guestbehaboxplot(hotels$booking_changes)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(booking_changes == 0) %>% 
  count() -> filchan      

as.data.frame(filchan/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.8498183
## 2                NA 0.8115486

4.8 Meal choice of repeated guests.

By comparing the meal choice of unrepeated guests, repeated guests have high percentage that choose Bed & Breakfast, lower percentage for all the other three type of meal plan.

guestbehabar(hotels$meal)

table(hotels$meal, hotels$is_repeated_guest)
##     
##          0     1
##   BB 88837  3473
##   FB   789     9
##   HB 14277   186
##   SC 11677   142
hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(meal == "BB") %>% 
  count() -> filmea     

as.data.frame(filmea/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.7686191
## 2                NA 0.9115486

4.9 Do repeated guests make deposit?

Normally no. 98.2% of repeated guests don’t make deposit. A possiable reason is that they are reliable guests, so that they don’t need to make deposit. The percentage for unrepeated guests is 87.3%.

guestbehabar(hotels$deposit_type)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(deposit_type == "No Deposit") %>% 
  count() -> fildep       

as.data.frame(fildep/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.8729798
## 2                NA 0.9821522

4.10 Do repeated guests make special requests?

More than half of them don’t. 58.2% of repeated guests don’t make special requests. This percentage is very close to that of unrepeated guests.

guestbehabar(factor(hotels$total_of_special_requests))

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(total_of_special_requests == 0) %>% 
  count() -> filspe   

as.data.frame(filspe/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.5892109
## 2                NA 0.5818898

4.11 Customer type of repeated guests

80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking. 4.2% of the repeated guests make bookings that are associated to a group. Both of these two numbers are higher than that of unrepeated guests making the bookings.

guestbehabar(hotels$customer_type)

custype <- function(type){
  hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(customer_type == type) %>% 
  count()
}

custype("Transient") -> filtra
custype("Group") -> filgro

as.data.frame(filtra/total) 
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.7487455
## 2                NA 0.8065617
as.data.frame(filgro/total)                 
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest           n
## 1                NA 0.003599239
## 2                NA 0.042257218

5 SUMMARY AND INSIGHT