Getting start

The data is from Kaggle”https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand

hotel_bookings <- read.csv("~/Desktop/Kaggle/Hotel Revenue/hotel_bookings.csv")

Introduction to the background and some detail of the dataset

Content

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

All personally identifying information has been removed from the data.

Acknowledgements

The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.

I regroup the detail of the dataset in the following sheet.

variable description
hotel Hotel (H1 = Resort Hotel or H2 = City Hotel).
is_canceled Value indicating if the booking was canceled (1) or not (0).
lead_time Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year Year of arrival date
arrival_date_month Month of arrival date
arrival_date_week_number Week number of year for arrival date
arrival_date_day_of_month. Day of arrival date
stays_in_weekend_nights Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults Number of adults
children Number of children
babies Number of babies
meal Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)
country Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type. Code of room type reserved. Code is presented instead of designation for anonymity reasons.
assigned_room_type Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons.
booking_changes Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
agent ID of the travel agency that made the booking
company ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
adr Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces Number of car parking spaces required by the customer
total_of_special_requests Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Analysis report scene create

In order to present a better report, in the following analysis I will assume myself as a revenue analyst of “Resort Hotel”. :)

Load the packages.

library(ggplot2)
library(dplyr)
library(skimr)
library(statsr)
library(tidyverse)
library(caret)
library(lattice)
library(magrittr)
library(rvest)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf) 
library(cowplot)
library(googleway)
library(maps)
library(Hmisc)
library(scales)
library(gridExtra)

options(dplyr.summarise.inform = FALSE)

Quick look of the data structure

names(hotel_bookings)
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"
str(hotel_bookings)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
skim(hotel_bookings)
Data summary
Name hotel_bookings
Number of rows 119390
Number of columns 32
_______________________
Column type frequency:
character 14
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
hotel 0 1 10 12 0 2 0
arrival_date_month 0 1 3 9 0 12 0
meal 0 1 2 9 0 5 0
country 0 1 2 4 0 178 0
market_segment 0 1 6 13 0 8 0
distribution_channel 0 1 3 9 0 5 0
reserved_room_type 0 1 1 1 0 10 0
assigned_room_type 0 1 1 1 0 12 0
deposit_type 0 1 10 10 0 3 0
agent 0 1 1 4 0 334 0
company 0 1 1 4 0 353 0
customer_type 0 1 5 15 0 4 0
reservation_status 0 1 7 9 0 3 0
reservation_status_date 0 1 10 10 0 926 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
is_canceled 0 1 0.37 0.48 0.00 0.00 0.00 1 1 ▇▁▁▁▅
lead_time 0 1 104.01 106.86 0.00 18.00 69.00 160 737 ▇▂▁▁▁
arrival_date_year 0 1 2016.16 0.71 2015.00 2016.00 2016.00 2017 2017 ▃▁▇▁▆
arrival_date_week_number 0 1 27.17 13.61 1.00 16.00 28.00 38 53 ▅▇▇▇▅
arrival_date_day_of_month 0 1 15.80 8.78 1.00 8.00 16.00 23 31 ▇▇▇▇▆
stays_in_weekend_nights 0 1 0.93 1.00 0.00 0.00 1.00 2 19 ▇▁▁▁▁
stays_in_week_nights 0 1 2.50 1.91 0.00 1.00 2.00 3 50 ▇▁▁▁▁
adults 0 1 1.86 0.58 0.00 2.00 2.00 2 55 ▇▁▁▁▁
children 4 1 0.10 0.40 0.00 0.00 0.00 0 10 ▇▁▁▁▁
babies 0 1 0.01 0.10 0.00 0.00 0.00 0 10 ▇▁▁▁▁
is_repeated_guest 0 1 0.03 0.18 0.00 0.00 0.00 0 1 ▇▁▁▁▁
previous_cancellations 0 1 0.09 0.84 0.00 0.00 0.00 0 26 ▇▁▁▁▁
previous_bookings_not_canceled 0 1 0.14 1.50 0.00 0.00 0.00 0 72 ▇▁▁▁▁
booking_changes 0 1 0.22 0.65 0.00 0.00 0.00 0 21 ▇▁▁▁▁
days_in_waiting_list 0 1 2.32 17.59 0.00 0.00 0.00 0 391 ▇▁▁▁▁
adr 0 1 101.83 50.54 -6.38 69.29 94.58 126 5400 ▇▁▁▁▁
required_car_parking_spaces 0 1 0.06 0.25 0.00 0.00 0.00 0 8 ▇▁▁▁▁
total_of_special_requests 0 1 0.57 0.79 0.00 0.00 0.00 1 5 ▇▁▁▁▁

Finding: there are 4 missing value in “Children”

hotel_bookings %>% 
  group_by(hotel) %>% 
  summarise(count=n())
## # A tibble: 2 × 2
##   hotel        count
##   <chr>        <int>
## 1 City Hotel   79330
## 2 Resort Hotel 40060
glimpse(hotel_bookings)
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year              <int> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <chr> "Transient", "Transient", "Transient", …
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date        <chr> "2015-07-01", "2015-07-01", "2015-07-02…

Finding: there are “NULL” values in ‘agent’ & ‘company’

hotel_bookings %>% 
  group_by(hotel) %>% 
  summarise(count=n())
## # A tibble: 2 × 2
##   hotel        count
##   <chr>        <int>
## 1 City Hotel   79330
## 2 Resort Hotel 40060
hotel_bookings %>% 
  group_by(arrival_date_year) %>% 
  summarise(count=n())
## # A tibble: 3 × 2
##   arrival_date_year count
##               <int> <int>
## 1              2015 21996
## 2              2016 56707
## 3              2017 40687
hotel_bookings %>% 
  group_by(arrival_date_week_number) %>% 
  summarise(count=n())
## # A tibble: 53 × 2
##    arrival_date_week_number count
##                       <int> <int>
##  1                        1  1047
##  2                        2  1218
##  3                        3  1319
##  4                        4  1487
##  5                        5  1387
##  6                        6  1508
##  7                        7  2109
##  8                        8  2216
##  9                        9  2117
## 10                       10  2149
## # ℹ 43 more rows
hotel_bookings %>% 
  group_by(meal) %>% 
  summarise(count=n())
## # A tibble: 5 × 2
##   meal      count
##   <chr>     <int>
## 1 BB        92310
## 2 FB          798
## 3 HB        14463
## 4 SC        10650
## 5 Undefined  1169
hotel_bookings %>% 
  group_by(country) %>% 
  summarise(count=n())
## # A tibble: 178 × 2
##    country count
##    <chr>   <int>
##  1 ABW         2
##  2 AGO       362
##  3 AIA         1
##  4 ALB        12
##  5 AND         7
##  6 ARE        51
##  7 ARG       214
##  8 ARM         8
##  9 ASM         1
## 10 ATA         2
## # ℹ 168 more rows
hotel_bookings %>% 
  group_by(market_segment) %>% 
  summarise(count=n())
## # A tibble: 8 × 2
##   market_segment count
##   <chr>          <int>
## 1 Aviation         237
## 2 Complementary    743
## 3 Corporate       5295
## 4 Direct         12606
## 5 Groups         19811
## 6 Offline TA/TO  24219
## 7 Online TA      56477
## 8 Undefined          2
hotel_bookings %>% 
  group_by(distribution_channel) %>% 
  summarise(count=n())
## # A tibble: 5 × 2
##   distribution_channel count
##   <chr>                <int>
## 1 Corporate             6677
## 2 Direct               14645
## 3 GDS                    193
## 4 TA/TO                97870
## 5 Undefined                5
hotel_bookings %>% 
  group_by(customer_type) %>% 
  summarise(count=n())
## # A tibble: 4 × 2
##   customer_type   count
##   <chr>           <int>
## 1 Contract         4076
## 2 Group             577
## 3 Transient       89613
## 4 Transient-Party 25124
hotel_bookings %>% 
  group_by(arrival_date_month) %>% 
  summarise(count=n())
## # A tibble: 12 × 2
##    arrival_date_month count
##    <chr>              <int>
##  1 April              11089
##  2 August             13877
##  3 December            6780
##  4 February            8068
##  5 January             5929
##  6 July               12661
##  7 June               10939
##  8 March               9794
##  9 May                11791
## 10 November            6794
## 11 October            11160
## 12 September          10508

Data cleaning

We have found 4 “NA” value in children, since there are only 4, it will be fine to delete them

noNA<- hotel_bookings %>% 
  na.omit(hotel_bookings)

Then let’s find out whether there exist some rows which is not logical I check from the ‘summary(hotel)’, “arrival_date_year/month/week_number/day of month” is with their range and thus, let’s check whether there exist areas which may have fault potentially

It is impossible that the sum of the adults, children and babies is zero

acb0 <- 
  subset(noNA, noNA$adults == 0 & noNA$children == 0 & noNA$babies ==0)

There are 180 abnormal observations So we should remove those observations

final<- noNA %>% 
  filter(adults +children +babies !=0)

It is impossible that a check-in customer stay for 0 night.

final %>% 
  mutate(night = stays_in_weekend_nights + stays_in_week_nights) %>% 
  filter(is_canceled ==0 , night == 0) %>% 
  select(night) %>%
  group_by(night) %>% 
  summarise(count=n())
## # A tibble: 1 × 2
##   night count
##   <int> <int>
## 1     0   622

EDA(Exploratory data analysis)

Customer analysis

Customer groups distribution

if we could define there are four type of customer groups(adult only, adults with children, adults with babies, adults with children and babies), what’s the distribution of the two hotel?

customer_group <- final %>% 
  select(hotel, adults,children,babies)

customer_group1 <-customer_group %>% 
  mutate(group= case_when(adults >  0 & children == 0 & babies == 0 ~ 'Adult_only',
                          adults >  0 & children >  0 & babies == 0 ~ 'Adult_children',
                          adults >  0 & children == 0 & babies >  0 ~ 'Adult_babies',
                          adults >  0 & children >  0 & babies >  0 ~ 'Adult_children_babies',
                          adults == 0 & children >  0 & babies == 0 ~ 'Children',
                          adults == 0 & children == 0 & babies >  0 ~ 'Babies',
                          adults == 0 & children >  0 & babies >  0 ~ 'Children_babies'
                          ))

customer_group1 %>% 
  group_by(group) %>% 
  summarise(count=n())
## # A tibble: 6 × 2
##   group                  count
##   <chr>                  <int>
## 1 Adult_babies             742
## 2 Adult_children          8195
## 3 Adult_children_babies    172
## 4 Adult_only            109874
## 5 Children                 220
## 6 Children_babies            3
customer_group2 <- customer_group1 %>% 
  select(hotel,group) 

customer_group2 <- 
  customer_group2 %>% 
  group_by(hotel,group) %>% 
  summarise(count=n())

customer_group_CityHotel <- 
  customer_group2 %>% 
  filter(hotel == "City Hotel")

customer_group1 %>% 
  mutate(group = factor(group, levels=c("Adult_only","Adult_children","Adult_babies","Children","Adult_children_babies", "Children_babies"))) %>% 
  ggplot(aes(x = group,
             y = stat(count),
         fill = factor(hotel),
         label = stat(count)))+
  geom_bar(position = position_dodge())+
  geom_text(stat = "count",
            position = position_dodge(.9),
            vjust = -0.5,
            size = 3)+
  labs(title = "Customer groups distribution by hotel", 
       x = "Customer group",
       y = "Count")+                               
  theme(plot.title = element_text(hjust=0.5))+     
  scale_fill_discrete(
    name= "Hotel"
  )          
## Warning: `stat(count)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(customer_group2)
## # A tibble: 10 × 3
## # Groups:   hotel [2]
##    hotel        group                 count
##    <chr>        <chr>                 <int>
##  1 City Hotel   Adult_babies            297
##  2 City Hotel   Adult_children         4814
##  3 City Hotel   Adult_children_babies    69
##  4 City Hotel   Adult_only            73756
##  5 City Hotel   Children                220
##  6 City Hotel   Children_babies           3
##  7 Resort Hotel Adult_babies            445
##  8 Resort Hotel Adult_children         3381
##  9 Resort Hotel Adult_children_babies   103
## 10 Resort Hotel Adult_only            36118

Customer group size

customer_size <- final %>% 
  filter(is_canceled == 0) %>% 
  mutate(groupsize= adults+ children+ babies)


customer_size %>% 
  group_by(hotel,groupsize) %>% 
  summarise(count=n())
## # A tibble: 12 × 3
## # Groups:   hotel [2]
##    hotel        groupsize count
##    <chr>            <int> <int>
##  1 City Hotel           1 10196
##  2 City Hotel           2 29716
##  3 City Hotel           3  4943
##  4 City Hotel           4  1158
##  5 City Hotel           5    69
##  6 City Hotel          10     1
##  7 City Hotel          12     1
##  8 Resort Hotel         1  5830
##  9 Resort Hotel         2 19763
## 10 Resort Hotel         3  2175
## 11 Resort Hotel         4  1125
## 12 Resort Hotel         5    34
customer_size %>% 
  ggplot(aes(x = groupsize,
             y = stat(count),
             fill = factor(hotel),
             label = stat(count)
  ))+
  geom_bar(position = position_dodge())+
  geom_text(stat = "count",
            position = position_dodge(.9),
            vjust = -0.5,
            size = 3)+
  theme(axis.title.x = element_text(size = 15))+
  labs(title = "Group size count by hotel", 
       x = "Group size",
       y = "Count")+                               
  theme(plot.title = element_text(hjust=0.5))+     
  scale_fill_discrete(
    name= "Hotel"
  )          

Brief conclusion: main group size is 2.

Customer world distribution of the two hotels

country_customer <- final %>%
  filter(is_canceled==0) %>% 
  select(hotel, country) %>% 
  group_by(hotel,country) %>% 
  summarise(count=n())

Table of visitor source markets count by hotel

country_customer_RH <- country_customer %>% 
  filter(hotel== "Resort Hotel") %>% 
  arrange(desc(count))

country_customer_RH %>% 
  arrange(desc(count))
## # A tibble: 119 × 3
## # Groups:   hotel [1]
##    hotel        country count
##    <chr>        <chr>   <int>
##  1 Resort Hotel PRT     10184
##  2 Resort Hotel GBR      5922
##  3 Resort Hotel ESP      3105
##  4 Resort Hotel IRL      1734
##  5 Resort Hotel FRA      1399
##  6 Resort Hotel DEU      1057
##  7 Resort Hotel CN        614
##  8 Resort Hotel NLD       458
##  9 Resort Hotel NULL      419
## 10 Resort Hotel USA       407
## # ℹ 109 more rows
country_customer_CH <- country_customer %>% 
  filter(hotel== "City Hotel") %>% 
  arrange(desc(count))

country_customer_CH %>% 
  arrange(desc(count))
## # A tibble: 151 × 3
## # Groups:   hotel [1]
##    hotel      country count
##    <chr>      <chr>   <int>
##  1 City Hotel PRT     10793
##  2 City Hotel FRA      7069
##  3 City Hotel DEU      5010
##  4 City Hotel GBR      3746
##  5 City Hotel ESP      3278
##  6 City Hotel ITA      2049
##  7 City Hotel BEL      1479
##  8 City Hotel NLD      1258
##  9 City Hotel USA      1185
## 10 City Hotel BRA      1063
## # ℹ 141 more rows
world_data <- map_data("world")
world_data1 <- ne_countries(scale = 110, returnclass = "sf")


country_customer_RH <- 
  subset(country_customer_RH, select = -hotel)
  
country_customer_RH <- country_customer_RH %>%  
rename_at('country', ~'iso_a3') 

world_customer_RH <- 
  merge(world_data1, country_customer_RH,
         by='iso_a3',
         all.x=TRUE
        )


ggplot(world_customer_RH)+
         geom_sf(aes(fill= count))+
         scale_fill_viridis_c(option = "F", trans= "sqrt")+
  labs(title = "Customer source market of Resort hotel" )+                               
  theme(plot.title = element_text(hjust=0.5))

country_customer_CH <- 
  subset(country_customer_CH, select = -hotel)

country_customer_CH <- country_customer_CH %>%  
  rename_at('country', ~'iso_a3') 

world_customer_CH <- 
  merge( world_data1,country_customer_CH,
        by='iso_a3',
        all.x= TRUE
        )       

ggplot(world_customer_CH)+
  geom_sf(aes(fill= count))+
  scale_fill_viridis_c(option = "F", trans= "sqrt")+
  labs(title = "Customer source market of City hotel",
       size= 5)+                               
  theme(plot.title = element_text(hjust=0.5))

Brief conclusion : Main customer source market is local country (Portugal), top 5 customer source market of Resort hotel is, Portugal, United Kingdom, Spain, Ireland and France, top 5 customer source market of City hotel is, Portugal, France, Germany, United Kingdom and Spain.

According to our top 5 customer source market, we can review our communication(ex. is all the communication/route sign in English and Portuguese?) In what range of cuisines are these customers favorite? Even about our room decoration and ornament, how to cater their demand?

The repeated customer rate of the two hotels

repeated_rate <- final %>% 
  filter(is_canceled == 0) %>% 
  select(hotel, is_repeated_guest)

repeated_rate %>% 
  group_by(hotel, is_repeated_guest) %>% 
  summarise(count=n())
## # A tibble: 4 × 3
## # Groups:   hotel [2]
##   hotel        is_repeated_guest count
##   <chr>                    <int> <int>
## 1 City Hotel                   0 44546
## 2 City Hotel                   1  1538
## 3 Resort Hotel                 0 27260
## 4 Resort Hotel                 1  1667
# The repeated customer rate of City Hotel is 3.3%

1538/(1538+44546)
## [1] 0.03337384
# The repeated customer rate of Resort Hotel is 5.8%

1667/(1667+27260)
## [1] 0.05762782

It’s glad to see that the repeated customer rate is higher than our competitor, it represents that our professional services, room comfort, delicious food, etc. bring our customer visit us repeatedly. We already have a great beginning, let’s consider how to bring more customers back, prepare an experience full of refreshment, novelty for them.

Reservation analysis

ADR (average daily rate) by day

ADR <- final %>% 
  filter(is_canceled == 0) %>% 
  select(hotel,adr, assigned_room_type,reservation_status_date, arrival_date_year, arrival_date_month)

unique(ADR$assigned_room_type)
##  [1] "C" "A" "D" "E" "G" "F" "I" "B" "H" "K"
ADR %>% 
  group_by(hotel,assigned_room_type) %>% 
  summarise(sum(adr))
## # A tibble: 17 × 3
## # Groups:   hotel [2]
##    hotel        assigned_room_type `sum(adr)`
##    <chr>        <chr>                   <dbl>
##  1 City Hotel   A                    2929058.
##  2 City Hotel   B                     140621.
##  3 City Hotel   C                      14575.
##  4 City Hotel   D                    1241157.
##  5 City Hotel   E                     225569.
##  6 City Hotel   F                     222180.
##  7 City Hotel   G                     100780.
##  8 City Hotel   K                      12630.
##  9 Resort Hotel A                     880505.
## 10 Resort Hotel B                      15343.
## 11 Resort Hotel C                     190925.
## 12 Resort Hotel D                     670342.
## 13 Resort Hotel E                     425117.
## 14 Resort Hotel F                     178899.
## 15 Resort Hotel G                     178914.
## 16 Resort Hotel H                      72821.
## 17 Resort Hotel I                      14358.
ADR %>% 
  group_by(hotel,assigned_room_type) %>% 
  summarise(count=n())
## # A tibble: 17 × 3
## # Groups:   hotel [2]
##    hotel        assigned_room_type count
##    <chr>        <chr>              <int>
##  1 City Hotel   A                  30081
##  2 City Hotel   B                   1493
##  3 City Hotel   C                    143
##  4 City Hotel   D                  10698
##  5 City Hotel   E                   1626
##  6 City Hotel   F                   1299
##  7 City Hotel   G                    568
##  8 City Hotel   K                    176
##  9 Resort Hotel A                  10996
## 10 Resort Hotel B                    150
## 11 Resort Hotel C                   1781
## 12 Resort Hotel D                   8249
## 13 Resort Hotel E                   4210
## 14 Resort Hotel F                   1525
## 15 Resort Hotel G                   1201
## 16 Resort Hotel H                    461
## 17 Resort Hotel I                    354
ADR_sum <-  ADR %>% 
  group_by(hotel,reservation_status_date) %>% 
  mutate(Count=(count=n()),Sum=(sum(adr))) %>% 
  select(-assigned_room_type, -adr)
  

ADR_sum <- ADR_sum %>% 
  mutate(total_ADR = Sum/Count) 

ADR_sum <- ADR_sum %>%
  distinct() %>% 
  select(hotel, reservation_status_date, total_ADR,arrival_date_year, arrival_date_month) %>% 
  arrange(reservation_status_date)




ADR_sum$reservation_status_date <- 
  as.Date(ADR_sum$reservation_status_date, "%Y-%m-%d")


ggplot(ADR_sum, aes(x = ADR_sum$reservation_status_date,
                    y = ADR_sum$total_ADR, 
                    group = ADR_sum$hotel,
                    fill= ADR_sum$hotel))+
  geom_line(aes(color = hotel), 
            linewidth= 1)+
  theme_classic()+
  guides(x= guide_axis(angle = 60))+
  scale_x_date(breaks="2 months", date_labels = "%Y-%m")+ 
  labs(title = "ADR by day",
       x= "",
      y= "ADR")+
  theme(plot.title = element_text(hjust=0.5))

Brief conclusion : Resort hotel take more price adjustment than City Hotel.

Thanks to our awesome cooperation to sales team & service team, we are able to optimize the revenue. Our pricing strategy varies depending demand and supply situation.

ADR boxplot

final %>% 
  filter(is_canceled== 0) %>% 
  ggplot(aes(x=reserved_room_type,
             y= adr,
             fill= factor(hotel)))+
  geom_boxplot(position = position_dodge())+
  labs(
    title = "ADR distribution by hotel",
    x= "reserved room type ",
    y= "ADR"
  )+
  scale_fill_discrete(
    name= "Hotel"
  )+
  theme(plot.title = element_text(hjust=0.5))

final %>% 
  filter(is_canceled== 0) %>% 
  ggplot(aes(x=reserved_room_type,
             y= adr,
             fill= factor(hotel)))+
  geom_boxplot(position = position_dodge())+
  labs(
    title = "ADR distribution by hotel & year",
    x= "reserved room type ",
    y= "ADR"
  )+
  scale_fill_discrete(
    name= "Hotel"
  )+
  theme(plot.title = element_text(hjust=0.5))+
  facet_wrap(~arrival_date_year)

Brief conclusion: From the following “Reservation analysis”, we will know that Room Type A is the best-selling room type. The pricing strategy of City Hotel in 2015, is differ from those in 2016 & 2017. In 2015, the mean ADR of Room type “B”, “C” and “G” is lower than “A”, however in 2016 & 2017, the mean ADR of Room type “A” is lower than “C”, “D”, “E”, “F” and “G”.

In 2015, we proceeded a sale technology investigation of other hotels, we found that City Hotel had add-on sales for the offers, for example, combining the restaurant tickets.

In our resort hotel, we have various experience offers, like SPA, entertainment shows, food, art exhibition, etc. Some customers even don’t know actually we can offer these service, thus, we decide to highlight these offer, give a full experience solution to our customers.

In 2016 & 2017, we could see from the plot, the outlier points are much more than 2015, and our competitor–City Hotel, which means that our customer are willing to purchase our add-on service,and last not the least, our sales team execute our new sale mode very well.

However, these outlier points should be token a deeper analysis, to check whether there exists abnormal activity or new customer demand opportunity.

LOS (length of Stay) distribution analysis

final %>% 
  filter(is_canceled == 0) %>% 
  mutate(stay_nights = stays_in_weekend_nights + stays_in_week_nights ) %>% 
  ggplot(aes(x = stay_nights,
             fill = factor(hotel),
             label = stat(count)))+
  geom_bar(stat = "count",
           position = position_dodge(),
           width =0.8 ) +
  geom_text(stat = "count",
            position = position_dodge(.9),
            vjust= - 0.5) +
  scale_fill_discrete(
    name = "Hotel") +
  labs(title = "LOS by hotel(less than 15 nights)", x = "Stay night", y = "Count")+
  theme(plot.title = element_text(hjust=0.5))+
  theme(axis.text.x = element_text(angle = 0, size = 10 ))+
  scale_x_continuous(limits = c(0.5,14.5),breaks = seq(1, 14, 1))
## Warning: Removed 809 rows containing non-finite values (`stat_count()`).
## Removed 809 rows containing non-finite values (`stat_count()`).

final %>% 
  filter(is_canceled == 0) %>% 
  mutate(stay_nights = stays_in_weekend_nights + stays_in_week_nights ) %>% 
  ggplot(aes(x = stay_nights,
             fill = factor(hotel),
             label = stat(count)))+
  geom_bar(stat = "count",
           position = position_dodge(),
           width =0.8 ) +
  geom_text(stat = "count",
            position = position_dodge(.9),
            vjust= - 0.5) +
  scale_fill_discrete(
    name = "Hotel") +
  labs(title = "LOS by hotel(over 15 nights)", x = "Stay night", y = "Count")+
  theme(plot.title = element_text(hjust=0.5))+
  theme(axis.text.x = element_text(angle = 0, size = 10 ))+
  scale_x_continuous(limits = c(14.5,80),breaks = seq(15, 80,5))
## Warning: Removed 74824 rows containing non-finite values (`stat_count()`).
## Warning: Removed 74824 rows containing non-finite values (`stat_count()`).

final %>% 
  filter(is_canceled == 0, hotel == "Resort Hotel", arrival_date_year!= 2015 ) %>% 
  mutate(LOS = stays_in_weekend_nights+stays_in_week_nights) %>% 
  select(LOS) %>% 
  summary(LOS)
##       LOS        
##  Min.   : 0.000  
##  1st Qu.: 2.000  
##  Median : 3.000  
##  Mean   : 4.074  
##  3rd Qu.: 6.000  
##  Max.   :69.000
final %>% 
  filter(is_canceled == 0, hotel == "City Hotel", arrival_date_year!= 2015 ) %>% 
  mutate(LOS = stays_in_weekend_nights+stays_in_week_nights) %>% 
  select(LOS) %>% 
  summary(LOS)
##       LOS       
##  Min.   : 0.00  
##  1st Qu.: 2.00  
##  Median : 3.00  
##  Mean   : 2.95  
##  3rd Qu.: 4.00  
##  Max.   :48.00

Brief conclusion: Main customers stay in City Hotel within 4 nights; Resort Hotel has a more effective ability to keep customer stay longer than 7 nights.

We also found that there exists a certain amount of customers who have demand of staying more than 7 nights, we also collect some voice from reservations dpmt to prove this demand.

Then we use website, E-mail, outside billboard to propagandize a slogan, “Longer you stay, Less you pay”, and print a special hotline number for calling on the advertisement. We receive numbers of hotline calling and inquiries for medium-term and long-term reservation.

As a result, in 2016 & 2017, our mean of LOS is 1 day more than City Hotel.

Reserved room type distribution by hotel

room_dist <- final %>% 
  select(hotel,reserved_room_type)


room_dist_count <-  room_dist %>% 
  group_by(hotel,reserved_room_type) %>% 
  summarise(count=n())

ggplot(room_dist,
       aes(room_dist$reserved_room_type,           
           fill= factor(hotel),                    
           label= (stat(count))))+                 
  geom_bar(position = position_dodge())+
  geom_text(stat = "count", 
            position = position_dodge(.9),
            vjust = -0.5,
            size= 2.5)+                            
  labs(title = "The distribution of the reserved room type by hotel", 
       x = "Reserved room type",
       y = "Count")+                               
  theme(plot.title = element_text(hjust=0.5))+     
  scale_fill_discrete(
    name= "Hotel"
  )                                               

Brief conclusion: Type A is the most famous room type of the two hotel.

Reservation status by month

Let’s take an overall look on the reservation status of both hotels

final %>% 
  mutate(arrival_date_month = factor(
    arrival_date_month,
    levels = c(
      "January",
      "February",
      "March",
      "April",
      "May",
      "June",
      "July",
      "August",
      "September",
      "October",
      "November",
      "December"
    ),
    ordered = TRUE
  )) %>% 
  ggplot(aes(arrival_date_month, 
             fill = factor(is_canceled),
              label = stat(count))) +
  geom_bar(position = position_dodge()) +
  geom_text(stat = "count",
            position = position_dodge(.9),
            vjust = -0.5,  
            size= 2) +
  scale_fill_discrete(
    name = "Reservation Status",
    breaks = c("0", "1"),
    label = c("Not Cancelled", "Cancelled")) +
  facet_wrap(~arrival_date_year)+
  labs(title = "Reservation Status by Month", x = "Month", y = "Count") +
  scale_x_discrete(labels = month.abb)+                               
  theme(plot.title = element_text(hjust=0.5))

Cancel rate by hotel & month

can_rate_mon <- final %>% 
  select(is_canceled, hotel, arrival_date_year,arrival_date_month) %>% 
  mutate(arrival_date_month = factor(
    arrival_date_month,
    levels = c(
      "January",
      "February",
      "March",
      "April",
      "May",
      "June",
      "July",
      "August",
      "September",
      "October",
      "November",
      "December"
    ),
    ordered = TRUE
  ))



can_rate_mon_true <- can_rate_mon %>% 
  filter(is_canceled == 1)

abc <-  can_rate_mon_true %>% 
  group_by(hotel, arrival_date_year, arrival_date_month) %>% 
  summarise(cancel=n())

abcd <- can_rate_mon %>% 
  group_by(hotel, arrival_date_year, arrival_date_month) %>% 
  summarise(total=n())

can_rate_mon1 <- merge(abc,abcd,
                       c("hotel", "arrival_date_year","arrival_date_month")
                       )


can_rate_mon1 <-  can_rate_mon1 %>% 
  mutate(can_rate =  percent(round(cancel/total, 3)))

rm(abc)
rm(abcd)


can_rate_mon1 %>% 
  ggplot(aes(x= arrival_date_month,
             y= can_rate,
             fill = hotel,
             )) +
 geom_bar(stat= "identity",position = position_dodge())+
  facet_wrap(~arrival_date_year)+
  guides(x= guide_axis(angle = 60)) +
  geom_text(aes(label = can_rate),
            vjust = -0.5,
            position = position_dodge(.9)
            )+
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())+
  labs(title = "Cancel rate by hotel & month",
       x= "Month",
       y= "Cancel Rate")+                               
  theme(plot.title = element_text(hjust=0.5))

Brief conclusion: City hotel has a higher cancel rate than Resort hotel until July, 2017.

Recently our cancel rate is raising rapidly, we should keep an eye on it.

Is there any competitive hotel emerging around us?

Can we confirm our service satisfactory?

Does customer purchasing behavior change? Is our deposit type suitable for a better business mode?

We need find out the reason and take action to decrease the cancel rate.

Lead time vs. Cancellation

final %>% 
  filter(hotel == "City Hotel") %>% 
  ggplot(aes(x=lead_time,
             fill= factor(is_canceled),
             label = stat(count)))+
  geom_density(alpha=0.5)+
  labs(title="Lead time vs cancellation(City Hotel)",
       x="Lead Time",
       y="Density",
       )+
  scale_fill_discrete(
    name = "Cancelled?",
    breaks = c("0", "1"),
    label = c("Not Cancelled", "Cancelled")
  )

final %>% 
  filter(hotel == "Resort Hotel") %>% 
  ggplot(aes(x=lead_time,
             fill= factor(is_canceled),
             label = stat(count)))+
  geom_density(alpha=0.5)+
  labs(title="Lead time vs cancellation(Resort Hotel)",
       x="Lead Time",
       y="Density",
       )+
  scale_fill_discrete(
    name = "Cancelled?",
    breaks = c("0", "1"),
    label = c("Not Cancelled", "Cancelled")
  )

Brief conclusion:

Both hotel share a similar trend. The lead time decreases, the probability of confirmation is increase; the lead time increases, the likelihood of cancellation increases.

Customer type vs Cancellation

Pie6 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  group_by(booking_cancelations) %>% 
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Total Bookings Cancelations(City Hotel)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Blues")


Pie7 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(customer_type == "Transient"  ) %>% 
  group_by(booking_cancelations) %>% 
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Transient",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

Pie8 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(customer_type == "Contract"  ) %>% 
  group_by(booking_cancelations) %>% 
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Contract",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

Pie9 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(customer_type == "Transient-Party"  ) %>% 
  group_by(booking_cancelations) %>% 
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Transient-Party",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

Pie10 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(customer_type == "Group" ) %>% 
  group_by(booking_cancelations) %>%
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Group",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")
  

grid.arrange(Pie6, Pie7, Pie9, Pie8, Pie10, ncol = 3)

Pie1 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  group_by(booking_cancelations) %>% 
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Total Bookings Cancelations(Resort Hotel)",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Blues")


Pie2 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(customer_type == "Transient"  ) %>% 
  group_by(booking_cancelations) %>% 
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Transient",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

Pie3 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(customer_type == "Contract"  ) %>% 
  group_by(booking_cancelations) %>% 
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Contract",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

Pie4 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(customer_type == "Transient-Party"  ) %>% 
  group_by(booking_cancelations) %>% 
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Transient-Party",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

Pie5 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(customer_type == "Group"  ) %>% 
  group_by(booking_cancelations) %>%
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Group",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")
  
  

grid.arrange(Pie1,  Pie2, Pie4, Pie3, Pie5, ncol = 3)

Brief conclusion : In terms of customer type, our cancellation rate performs better than City Hotel, especially in Transient aspect.

               However, there also exists potential for improvement, let's consider the nest step to further decrease the cancellation rate.

Deposit type vs Cancellation

Pie11 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  group_by(booking_cancelations) %>% 
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Total Bookings Cancelations(City Hotel)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Blues")


Pie12 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(deposit_type == "No Deposit"  ) %>% 
  group_by(booking_cancelations) %>% 
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "No Deposit",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")
  

Pie13 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(deposit_type == "Refundable"  ) %>%
  group_by(booking_cancelations) %>% 
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Refundable",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

Pie14 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(deposit_type == "Non Refund"  ) %>%
  group_by(booking_cancelations) %>% 
  filter(hotel == "City Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Non Refund",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

  

grid.arrange(Pie11, Pie12, Pie14, Pie13, ncol = 2)

Pie15 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  group_by(booking_cancelations) %>% 
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Total Bookings Cancelations(Resort Hotel)",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Blues")

Pie16 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(deposit_type == "No Deposit"  ) %>% 
  group_by(booking_cancelations) %>% 
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "No Deposit",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")


Pie17 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(deposit_type == "Refundable"  ) %>%
  group_by(booking_cancelations) %>% 
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Refundable") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")

Pie18 <- final %>% 
  mutate(
        booking_cancelations = ifelse(is_canceled == 0, "Not Canceled", "Canceled")
        ) %>%
  filter(deposit_type == "Non Refund"  ) %>% 
  group_by(booking_cancelations) %>%
  filter(hotel == "Resort Hotel") %>% 
  summarise(booking_count= n()) %>% 
  mutate(Percent = round(booking_count / sum(booking_count), 3)) %>%
  ggplot(aes(x = "", y = booking_count, fill = booking_cancelations)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  geom_text(aes(label = paste0(Percent*100, "%")), 
              position = position_stack(vjust=.5), 
              size = 5)  +
  labs(title = "Non Refund",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Greens")
  

grid.arrange(Pie15, Pie16, Pie18, Pie17, ncol = 2)

Brief conclusion: It will be interesting that the cancellation in the “Non refund” aspect, both hotel reach up to more than 95%. Which means that, although we add a limitation to urge the customers to show up, this range of customer will always cancel the room in the end.

First, we need to dive in to find out the root reason. Which market segment do they belong? The customer has other schedule arrangement, or they rather choose a hotel besides us?

We will also review our overbook strategy with Reservation, in order to optimize the revenue.

Logistic regression

Data pre-process (transform the character values into factors)

In order to execute a regression, the character values should be transformed into factors.

str(final)
## 'data.frame':    119206 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
##  - attr(*, "na.action")= 'omit' Named int [1:4] 40601 40668 40680 41161
##   ..- attr(*, "names")= chr [1:4] "40601" "40668" "40680" "41161"
final_regression <-
  as.data.frame(unclass(final), stringsAsFactors = TRUE)

str(final_regression)  
## 'data.frame':    119206 obs. of  32 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ country                       : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
##  $ market_segment                : Factor w/ 7 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel          : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : Factor w/ 9 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
##  $ assigned_room_type            : Factor w/ 11 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ agent                         : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
##  $ company                       : Factor w/ 349 levels "10","100","101",..: 349 349 349 349 349 349 349 349 349 349 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ reservation_status_date       : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
summary(final_regression)
##           hotel        is_canceled       lead_time     arrival_date_year
##  City Hotel  :79159   Min.   :0.0000   Min.   :  0.0   Min.   :2015     
##  Resort Hotel:40047   1st Qu.:0.0000   1st Qu.: 18.0   1st Qu.:2016     
##                       Median :0.0000   Median : 69.0   Median :2016     
##                       Mean   :0.3707   Mean   :104.1   Mean   :2016     
##                       3rd Qu.:1.0000   3rd Qu.:161.0   3rd Qu.:2017     
##                       Max.   :1.0000   Max.   :737.0   Max.   :2017     
##                                                                         
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  August :13857      Min.   : 1.00            Min.   : 1.0             
##  July   :12644      1st Qu.:16.00            1st Qu.: 8.0             
##  May    :11780      Median :28.00            Median :16.0             
##  October:11147      Mean   :27.16            Mean   :15.8             
##  April  :11078      3rd Qu.:38.00            3rd Qu.:23.0             
##  June   :10929      Max.   :53.00            Max.   :31.0             
##  (Other):47771                                                        
##  stays_in_weekend_nights stays_in_week_nights     adults          children     
##  Min.   : 0.0000         Min.   : 0.000       Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.000       1st Qu.: 2.000   1st Qu.: 0.000  
##  Median : 1.0000         Median : 2.000       Median : 2.000   Median : 0.000  
##  Mean   : 0.9271         Mean   : 2.499       Mean   : 1.859   Mean   : 0.104  
##  3rd Qu.: 2.0000         3rd Qu.: 3.000       3rd Qu.: 2.000   3rd Qu.: 0.000  
##  Max.   :19.0000         Max.   :50.000       Max.   :55.000   Max.   :10.000  
##                                                                                
##      babies                 meal          country            market_segment 
##  Min.   : 0.000000   BB       :92232   PRT    :48479   Aviation     :  235  
##  1st Qu.: 0.000000   FB       :  798   GBR    :12120   Complementary:  728  
##  Median : 0.000000   HB       :14458   FRA    :10401   Corporate    : 5282  
##  Mean   : 0.007961   SC       :10549   ESP    : 8560   Direct       :12581  
##  3rd Qu.: 0.000000   Undefined: 1169   DEU    : 7285   Groups       :19791  
##  Max.   :10.000000                     ITA    : 3761   Offline TA/TO:24182  
##                                        (Other):28600   Online TA    :56407  
##  distribution_channel is_repeated_guest previous_cancellations
##  Corporate: 6651      Min.   :0.0000    Min.   : 0.00000      
##  Direct   :14611      1st Qu.:0.0000    1st Qu.: 0.00000      
##  GDS      :  193      Median :0.0000    Median : 0.00000      
##  TA/TO    :97750      Mean   :0.0315    Mean   : 0.08719      
##  Undefined:    1      3rd Qu.:0.0000    3rd Qu.: 0.00000      
##                       Max.   :1.0000    Max.   :26.00000      
##                                                               
##  previous_bookings_not_canceled reserved_room_type assigned_room_type
##  Min.   : 0.0000                A      :85873      A      :74020     
##  1st Qu.: 0.0000                D      :19179      D      :25309     
##  Median : 0.0000                E      : 6519      E      : 7798     
##  Mean   : 0.1371                F      : 2894      F      : 3751     
##  3rd Qu.: 0.0000                G      : 2092      G      : 2549     
##  Max.   :72.0000                B      : 1111      C      : 2370     
##                                 (Other): 1538      (Other): 3409     
##  booking_changes       deposit_type        agent          company      
##  Min.   : 0.0000   No Deposit:104457   9      :31921   NULL   :112438  
##  1st Qu.: 0.0000   Non Refund: 14587   NULL   :16278   40     :   924  
##  Median : 0.0000   Refundable:   162   240    :13922   223    :   784  
##  Mean   : 0.2188                       1      : 7187   67     :   267  
##  3rd Qu.: 0.0000                       14     : 3632   45     :   249  
##  Max.   :18.0000                       7      : 3532   153    :   213  
##                                        (Other):42734   (Other):  4331  
##  days_in_waiting_list         customer_type        adr         
##  Min.   :  0.000      Contract       : 4072   Min.   :  -6.38  
##  1st Qu.:  0.000      Group          :  574   1st Qu.:  69.50  
##  Median :  0.000      Transient      :89476   Median :  94.95  
##  Mean   :  2.321      Transient-Party:25084   Mean   : 101.97  
##  3rd Qu.:  0.000                              3rd Qu.: 126.00  
##  Max.   :391.000                              Max.   :5400.00  
##                                                                
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Canceled :42989   
##  1st Qu.:0.00000             1st Qu.:0.0000            Check-Out:75011   
##  Median :0.00000             Median :0.0000            No-Show  : 1206   
##  Mean   :0.06256             Mean   :0.5715                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  2015-10-21:  1460      
##  2015-07-06:   805      
##  2016-11-25:   790      
##  2015-01-01:   763      
##  2016-01-18:   625      
##  2015-07-02:   469      
##  (Other)   :114294
unique(final_regression$country)
##   [1] PRT  GBR  USA  ESP  IRL  FRA  NULL ROU  NOR  OMN  ARG  POL  DEU  BEL  CHE 
##  [16] CN   GRC  ITA  NLD  DNK  RUS  SWE  AUS  EST  CZE  BRA  FIN  MOZ  BWA  LUX 
##  [31] SVN  ALB  IND  CHN  MEX  MAR  UKR  SMR  LVA  PRI  SRB  CHL  AUT  BLR  LTU 
##  [46] TUR  ZAF  AGO  ISR  CYM  ZMB  CPV  ZWE  DZA  KOR  CRI  HUN  ARE  TUN  JAM 
##  [61] HRV  HKG  IRN  GEO  AND  GIB  URY  JEY  CAF  CYP  COL  GGY  KWT  NGA  MDV 
##  [76] VEN  SVK  FJI  KAZ  PAK  IDN  LBN  PHL  SEN  SYC  AZE  BHR  NZL  THA  DOM 
##  [91] MKD  MYS  ARM  JPN  LKA  CUB  CMR  BIH  MUS  COM  SUR  UGA  BGR  CIV  JOR 
## [106] SYR  SGP  BDI  SAU  VNM  PLW  QAT  EGY  PER  MLT  MWI  ECU  MDG  ISL  UZB 
## [121] NPL  BHS  MAC  TGO  TWN  DJI  STP  KNA  ETH  IRQ  HND  RWA  KHM  MCO  BGD 
## [136] IMN  TJK  NIC  BEN  VGB  TZA  GAB  GHA  TMP  GLP  KEN  LIE  GNB  MNE  UMI 
## [151] MYT  FRO  MMR  PAN  BFA  LBY  MLI  NAM  BOL  PRY  BRB  ABW  AIA  SLV  DMA 
## [166] PYF  GUY  LCA  ATA  GTM  ASM  MRT  NCL  KIR  SDN  ATF  SLE  LAO 
## 178 Levels: ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT AZE BDI ... ZWE

before proceeding a correlation analysis, we should transfer the factor, the null value in to numeric value.

unique(final_regression$arrival_date_month)
##  [1] July      August    September October   November  December  January  
##  [8] February  March     April     May       June     
## 12 Levels: April August December February January July June March ... September
c <- subset(final_regression,select = -reservation_status_date)

dmy <- dummyVars("~.", data= c, fullRank = T)
dat_trans <- data.frame(predict(dmy,newdata = c))


cor_final <- cor(dat_trans, method = "pearson" )

Correlation and P-value

cor_cancel <- abs(data.frame(subset(cor_final, select = is_canceled)) )

cor_cancel %>% 
  arrange(desc(is_canceled)) %>% 
  slice(1:30)
##                               is_canceled
## is_canceled                    1.00000000
## reservation_status.Check.Out   1.00000000
## deposit_type.Non.Refund        0.48153843
## country.PRT                    0.33690827
## lead_time                      0.29292975
## total_of_special_requests      0.23492516
## market_segment.Groups          0.22201637
## required_car_parking_spaces    0.19569608
## distribution_channel.TA.TO     0.17611720
## market_segment.Direct          0.15439659
## distribution_channel.Direct    0.15157094
## booking_changes                0.14482099
## hotel.Resort.Hotel             0.13705612
## customer_type.Transient        0.13332108
## reservation_status.No.Show     0.13170691
## assigned_room_type.D           0.12831050
## customer_type.Transient.Party  0.12436659
## country.FRA                    0.11835664
## country.GBR                    0.11731814
## previous_cancellations         0.11014693
## country.DEU                    0.10751850
## agent.NULL                     0.10227261
## company.NULL                   0.09968412
## agent.7                        0.08557600
## is_repeated_guest              0.08374000
## market_segment.Corporate       0.08163788
## agent.28                       0.07481489
## agent.19                       0.07108178
## agent.14                       0.07028640
## agent.29                       0.06750667

We found the correlation value between is_canceled and other every single value, however, it is important to check the p-value of them.

cor_matrix <- rcorr(as.matrix(dat_trans), type= "pearson")

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
  )
}

d <- flattenCorrMatrix(cor_matrix$r, cor_matrix$P) %>% 
  filter(row=='is_canceled') 

d <- d %>% 
  filter(p<0.01)


d_abs <- as.data.frame(d)
d_abs <- data.frame(d$row, d$column,abs(d$cor),d$p)
colnames(d_abs) <- c("row", "column", "cor", "P")
d_abs %>% 
  arrange(desc(d_abs$cor)) %>% 
  slice(1:25)
##            row                        column        cor P
## 1  is_canceled  reservation_status.Check.Out 1.00000000 0
## 2  is_canceled       deposit_type.Non.Refund 0.48153843 0
## 3  is_canceled                   country.PRT 0.33690827 0
## 4  is_canceled                     lead_time 0.29292975 0
## 5  is_canceled     total_of_special_requests 0.23492516 0
## 6  is_canceled         market_segment.Groups 0.22201637 0
## 7  is_canceled   required_car_parking_spaces 0.19569608 0
## 8  is_canceled    distribution_channel.TA.TO 0.17611720 0
## 9  is_canceled         market_segment.Direct 0.15439659 0
## 10 is_canceled   distribution_channel.Direct 0.15157094 0
## 11 is_canceled               booking_changes 0.14482099 0
## 12 is_canceled       customer_type.Transient 0.13332108 0
## 13 is_canceled    reservation_status.No.Show 0.13170691 0
## 14 is_canceled          assigned_room_type.D 0.12831050 0
## 15 is_canceled customer_type.Transient.Party 0.12436659 0
## 16 is_canceled                   country.FRA 0.11835664 0
## 17 is_canceled                   country.GBR 0.11731814 0
## 18 is_canceled        previous_cancellations 0.11014693 0
## 19 is_canceled                   country.DEU 0.10751850 0
## 20 is_canceled                    agent.NULL 0.10227261 0
## 21 is_canceled                  company.NULL 0.09968412 0
## 22 is_canceled                       agent.7 0.08557600 0
## 23 is_canceled             is_repeated_guest 0.08374000 0
## 24 is_canceled      market_segment.Corporate 0.08163788 0
## 25 is_canceled                      agent.28 0.07481489 0

Brief summary:

Regarding to the reservation_status.Check.Out, since it is rare to encounter a scene that customer to cancel the room after the customer checked out, although it has high correlation, but we should exclude it for the future modelling.

And we will select the cor > 0.1 to analyse, which is deposit_type, country,lead_time,total_of_special_requests, market_segement,distribution_channel, booking_changes, customer_type, reservation_status, assigned_room_type, previous_cancellations and agent.

We will split the 70% of the data set for training, 30% for testing.

model_prepare <- final_regression %>% 
  select(is_canceled,deposit_type,country,lead_time, 
         total_of_special_requests, market_segment,
         distribution_channel, booking_changes,
         customer_type,reservation_status, 
         assigned_room_type,previous_cancellations,agent)



model_encoding <- data.frame (is_canceled=model_prepare$is_canceled,
                              deposit_type=as.numeric(factor(model_prepare$deposit_type)),
                              country=as.numeric(factor(model_prepare$country)),
                              lead_time=model_prepare$lead_time,
                              total_of_special_requests=model_prepare$total_of_special_requests,
                              market_segement=as.numeric(factor(model_prepare$market_segment)),
                              distribution_channel=as.numeric(factor(model_prepare$distribution_channel)),
                              booking_changes=model_prepare$booking_changes,
                              customer_type=as.numeric(factor(model_prepare$customer_type)),
                              reservation_status=as.numeric(factor(model_prepare$reservation_status)),
                              assigned_room_type=as.numeric(factor(model_prepare$assigned_room_type)),
                              previous_cancellations=model_prepare$previous_cancellations,
                              agent=as.numeric(factor(model_prepare$agent))
                              )

str(model_encoding)
## 'data.frame':    119206 obs. of  13 variables:
##  $ is_canceled              : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ deposit_type             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ country                  : num  137 137 60 60 60 60 137 137 137 137 ...
##  $ lead_time                : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ total_of_special_requests: int  0 0 0 0 1 1 0 1 1 0 ...
##  $ market_segement          : num  4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel     : num  2 2 2 1 4 4 2 2 4 4 ...
##  $ booking_changes          : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ customer_type            : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ reservation_status       : num  2 2 2 2 2 2 2 2 1 1 ...
##  $ assigned_room_type       : num  3 3 3 1 1 1 3 3 1 4 ...
##  $ previous_cancellations   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ agent                    : num  334 334 334 157 103 103 334 156 103 40 ...
summary(is.na(model_encoding))
##  is_canceled     deposit_type     country        lead_time      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:119206    FALSE:119206    FALSE:119206    FALSE:119206   
##  total_of_special_requests market_segement distribution_channel booking_changes
##  Mode :logical             Mode :logical   Mode :logical        Mode :logical  
##  FALSE:119206              FALSE:119206    FALSE:119206         FALSE:119206   
##  customer_type   reservation_status assigned_room_type previous_cancellations
##  Mode :logical   Mode :logical      Mode :logical      Mode :logical         
##  FALSE:119206    FALSE:119206       FALSE:119206       FALSE:119206          
##    agent        
##  Mode :logical  
##  FALSE:119206
set.seed(777)
data_seperate <- sample(2,size = nrow(model_encoding), replace = T, prob = c(0.7, 0.3))

train_data <- model_encoding[data_seperate,]
test_data <-  model_encoding[!data_seperate,]

data_seperate = floor(0.7* nrow(model_encoding))


pick = sample(seq_len(nrow(model_encoding)), size = data_seperate) 
train_data = model_encoding[pick,]
test_data = model_encoding[-pick,]

Logistic regression building and examination

glm.fit <- glm(is_canceled ~ deposit_type + country +lead_time + total_of_special_requests + market_segement +
                             distribution_channel + booking_changes + customer_type + reservation_status + 
                             assigned_room_type + previous_cancellations + agent ,
               data = train_data, family = "binomial")

summary(glm.fit)
## 
## Call:
## glm(formula = is_canceled ~ deposit_type + country + lead_time + 
##     total_of_special_requests + market_segement + distribution_channel + 
##     booking_changes + customer_type + reservation_status + assigned_room_type + 
##     previous_cancellations + agent, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3098  -0.2544  -0.1959   0.1366   4.7806  
## 
## Coefficients:
##                             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                6.4697679  0.2291163   28.238  < 2e-16 ***
## deposit_type               1.5723409  0.0887952   17.708  < 2e-16 ***
## country                    0.0089999  0.0004702   19.142  < 2e-16 ***
## lead_time                  0.0013160  0.0002318    5.676 1.38e-08 ***
## total_of_special_requests -0.3216212  0.0277813  -11.577  < 2e-16 ***
## market_segement            0.2464650  0.0316410    7.789 6.73e-15 ***
## distribution_channel      -0.0663435  0.0426308   -1.556 0.119653    
## booking_changes           -0.1598362  0.0317858   -5.029 4.94e-07 ***
## customer_type             -0.1665301  0.0362803   -4.590 4.43e-06 ***
## reservation_status        -6.4293069  0.0439512 -146.283  < 2e-16 ***
## assigned_room_type        -0.0379678  0.0110435   -3.438 0.000586 ***
## previous_cancellations     0.1465801  0.0461994    3.173 0.001510 ** 
## agent                      0.0003795  0.0001836    2.067 0.038739 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 109875  on 83443  degrees of freedom
## Residual deviance:  21663  on 83431  degrees of freedom
## AIC: 21689
## 
## Number of Fisher Scoring iterations: 7
glm_predic <- predict(glm.fit,newdata = test_data, type = "response")

train_data$pred <- round(predict(glm.fit, newdata = train_data,type = "response"))
test_data$pred <- round(predict(glm.fit, newdata = test_data,type = "response"))
model_encoding$pred <- round(predict(glm.fit, newdata = model_encoding,type = "response"))

acc_reg_train <- sum(train_data$pred== train_data$is_canceled)/nrow(train_data)


acc_reg_test <- sum(test_data$pred== test_data$is_canceled)/nrow(test_data)

train data examination

acc_reg_train
## [1] 0.9895619

Regression model accuracy

test data examination

acc_reg_test
## [1] 0.9899334

Random Forest

Data preprocess

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf.train.data <- train_data%>% 
  select(-reservation_status,-pred)
rf.train.data$is_canceled <- as.factor(rf.train.data$is_canceled)


rf.fit <- randomForest(is_canceled~., data= rf.train.data, na.action = na.roughfix, importance = T)
rf.fit
## 
## Call:
##  randomForest(formula = is_canceled ~ ., data = rf.train.data,      importance = T, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.58%
## Confusion matrix:
##       0     1 class.error
## 0 48107  4553  0.08646031
## 1  7616 23168  0.24740125
rf.train.error <- as.data.frame(plot(rf.fit, type = "l"))

colnames(rf.train.error) <- paste('error', colnames(rf.train.error), sep = '')
rf.train.error$ntree <- 1:nrow(rf.train.error)
rf.train.error <- gather(rf.train.error, key = "Type",  value = 'Error', 1:3)

When the ntrees are around the 50, the error output tends to be stable.

ggplot(rf.train.error, 
       aes(x= ntree,
           y = Error))+
  geom_line(aes(linetype = Type, color = Type))

Take a look at the importance of the variables

importance(rf.fit, type = 1)
##                           MeanDecreaseAccuracy
## deposit_type                          60.58883
## country                              104.31665
## lead_time                            148.07801
## total_of_special_requests            100.16979
## market_segement                       81.71621
## distribution_channel                  45.16274
## booking_changes                       87.99819
## customer_type                         99.58494
## assigned_room_type                    56.53682
## previous_cancellations                76.59571
## agent                                 80.45816
importance(rf.fit, type = 2)
##                           MeanDecreaseGini
## deposit_type                     5148.9395
## country                          3608.6899
## lead_time                        3924.2226
## total_of_special_requests        1976.6258
## market_segement                  1973.2671
## distribution_channel              456.9512
## booking_changes                   740.9956
## customer_type                     942.8197
## assigned_room_type                827.8901
## previous_cancellations           1297.7080
## agent                            2374.8643
varImpPlot(rf.fit)+
  title("")

## numeric(0)
rf.test.data <- test_data%>% 
  select(-reservation_status, -pred)
rf.test.data$is_canceled <- as.factor(rf.test.data$is_canceled)
rf.test.pred <- predict(rf.fit, rf.test.data)

rf.test.data$pred <- predict(rf.fit, newdata = test_data,type = "response")

acc_rf_test <- sum(rf.test.data$pred== rf.test.data$is_canceled)/nrow(rf.test.data)

Random forest accuracy

acc_rf_test
## [1] 0.8530843