Travel is an enlightening and eye-opening experience for both adults and children of all ages. However, traveling can also be overwhelming due to unpredictable schedules, jet lag, where to stay et, cl. Taking staying for example, how would people that book the same hotels repeatedly behave? How likely Would they cancel a booking? What kind of meal plan would they choose? How many days ahead of checking in would they book a hotel? In this project, I am going to use a real data to answer these questions and more.
The original data for this project is from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019. The original data was then cleaned promarily by some data scientist. The source data can be found at tidytuesday. This dataset includes 32 variables, recording the hotel and stay information such as hotel type, arrivel time, number of people, what meal plan they choose and so on.
Explore the behavior the repeated guests by mining the relationship between variable “is_repeated_guest” and several other variables in this dataset. Display the visualization as well as statistics of the results. Dig the insights from the results
The project aims to gain interesting insight into customers’ behavior when booking a hotel. These insights can guide hotels to adjust their customer strategies and make preparation for unknown.
The following packages are needed to import and manipulate the data.
library(data.table) #import data
library(tidyverse) #tidy data, visualisation, transformation
library(tibble) #create tibbles
library(plotrix) #3D Exploded Pie Chart
library(tidyr) #tidy data
library(dplyr) #data analysis
library(DT) #display data set
library(ggplot2) #data visualization
library(magrittr) #pipe oprator
library(knitr) #display tables
First, I download the data set from website tidytuesday, and then read it into R.
hotels <- fread("E:/Courses/20Spring/Data wrangling/Final/hotels/hotels.csv")
This data set contains 119390 observations and 32 variables. Some of the variables are interger, others are charactor. It includes information about hotel names, lead time, number of children, agent, and much more.
str(hotels)
## Classes 'data.table' and 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
## - attr(*, ".internal.selfref")=<externalptr>
By checking missing values for each variables, we can see that there are only 4 missing value for variable “children”, no missing value in other variables.
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
There are four missing values in the variable “children”. They are imputed by the median value of this variable. Considering the big sample size, this approach would be appropriate.
hotels$children[is.na(hotels$children)] <- median(hotels$children, na.rm = TRUE)
The “Undefined” meal are imputed by “SC” because both “Undefined” and “SC” mean the customer choose not to eat at hotel.
hotels$meal <- as.factor(hotels$meal)
levels(hotels$meal) <- c("BB", "FB", "HB", "SC", "SC")
The variable “is_repeated_guest” is a binary, so it should be a factor. Same change needed for the variable “is_canceled”.
hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
hotels$is_canceled <- as.factor(hotels$is_canceled)
From the statistic summary, we can find some interesting information. For instance, about 37% of bookings have been canceled. The booking information is from year 2015 to 2017. Half of the bookings for weekdays have customers stay for two nights. 75% of the bookings have customers stay for three nignts. The longest stay is 50 nights. The average previous cancellation is less than one. 75% of the bookings have one special request. 3% of the bookings are made by repeated guests.
summary(hotels)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 0:75166 Min. : 0 Min. :2015
## Class :character 1:44224 1st Qu.: 18 1st Qu.:2016
## Mode :character Median : 69 Median :2016
## Mean :104 Mean :2016
## 3rd Qu.:160 3rd Qu.:2017
## Max. :737 Max. :2017
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB:92310 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB: 798 Class :character
## Median : 0.0000 Median : 0.000000 HB:14463 Mode :character
## Mean : 0.1039 Mean : 0.007949 SC:11819
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 0:115580
## Class :character Class :character 1: 3810
## Mode :character Mode :character
##
##
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
## assigned_room_type booking_changes deposit_type
## Length:119390 Min. : 0.0000 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character
## Mode :character Median : 0.0000 Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
## agent company days_in_waiting_list
## Length:119390 Length:119390 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.000
## Mean : 2.321
## 3rd Qu.: 0.000
## Max. :391.000
## customer_type adr required_car_parking_spaces
## Length:119390 Min. : -6.38 Min. :0.00000
## Class :character 1st Qu.: 69.29 1st Qu.:0.00000
## Mode :character Median : 94.58 Median :0.00000
## Mean : 101.83 Mean :0.06252
## 3rd Qu.: 126.00 3rd Qu.:0.00000
## Max. :5400.00 Max. :8.00000
## total_of_special_requests reservation_status reservation_status_date
## Min. :0.0000 Length:119390 Length:119390
## 1st Qu.:0.0000 Class :character Class :character
## Median :0.0000 Mode :character Mode :character
## Mean :0.5714
## 3rd Qu.:1.0000
## Max. :5.0000
The data description is shown below to make a reference whenever needed.
hotels.type <- lapply(hotels, class)
hotels.var_desc <- c("Hotel (H1 = Resort Hotel or H2 = City Hotel)",
"Value indicating if the booking was canceled (1) or not (0)",
"Number of days that elapsed between the entering date of the booking into the PMS and the arrival date",
"Year of arrival date",
"Month of arrival date",
"Week number of year for arrival date",
"Day of arrival date",
"Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel",
"Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel",
"Number of adults",
"Number of children",
"Number of babies",
"Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)",
"Country of origin. Categories are represented in the ISO 3155–3:2013 format",
"Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
"Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
"Value indicating if the booking name was from a repeated guest (1) or not (0)",
"Number of previous bookings that were cancelled by the customer prior to the current booking",
"Number of previous bookings not cancelled by the customer prior to the current booking",
"Code of room type reserved. Code is presented instead of designation for anonymity reasons",
"Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons",
"Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation",
"Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.",
"ID of the travel agency that made the booking",
"ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons",
"Number of days the booking was in the waiting list before it was confirmed to the customer",
"Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking",
"Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights",
"Number of car parking spaces required by the customer",
"Number of special requests made by the customer (e.g. twin bed or high floor)",
"Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why",
"Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel"
)
hotels.var_names <- colnames(hotels)
data.description <- as_data_frame(cbind(hotels.var_names, hotels.type, hotels.var_desc))
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
colnames(data.description) <- c("Variable name", "Data Type", "Variable Description")
kable(data.description)
| Variable name | Data Type | Variable Description |
|---|---|---|
| hotel | character | Hotel (H1 = Resort Hotel or H2 = City Hotel) |
| is_canceled | factor | Value indicating if the booking was canceled (1) or not (0) |
| lead_time | integer | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| arrival_date_year | integer | Year of arrival date |
| arrival_date_month | character | Month of arrival date |
| arrival_date_week_number | integer | Week number of year for arrival date |
| arrival_date_day_of_month | integer | Day of arrival date |
| stays_in_weekend_nights | integer | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| stays_in_week_nights | integer | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| adults | integer | Number of adults |
| children | numeric | Number of children |
| babies | integer | Number of babies |
| meal | factor | Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner) |
| country | character | Country of origin. Categories are represented in the ISO 3155–3:2013 format |
| market_segment | character | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| distribution_channel | character | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| is_repeated_guest | factor | Value indicating if the booking name was from a repeated guest (1) or not (0) |
| previous_cancellations | integer | Number of previous bookings that were cancelled by the customer prior to the current booking |
| previous_bookings_not_canceled | integer | Number of previous bookings not cancelled by the customer prior to the current booking |
| reserved_room_type | character | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
| assigned_room_type | character | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
| booking_changes | integer | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
| deposit_type | character | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay. |
| agent | character | ID of the travel agency that made the booking |
| company | character | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
| days_in_waiting_list | integer | Number of days the booking was in the waiting list before it was confirmed to the customer |
| customer_type | character | Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking |
| adr | numeric | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| required_car_parking_spaces | integer | Number of car parking spaces required by the customer |
| total_of_special_requests | integer | Number of special requests made by the customer (e.g. twin bed or high floor) |
| reservation_status | character | Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why |
| reservation_status_date | character | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |
The dataset look clean and tidy now.
hotels %>%
head(n = 20)
## hotel is_canceled lead_time arrival_date_year
## 1: Resort Hotel 0 342 2015
## 2: Resort Hotel 0 737 2015
## 3: Resort Hotel 0 7 2015
## 4: Resort Hotel 0 13 2015
## 5: Resort Hotel 0 14 2015
## 6: Resort Hotel 0 14 2015
## 7: Resort Hotel 0 0 2015
## 8: Resort Hotel 0 9 2015
## 9: Resort Hotel 1 85 2015
## 10: Resort Hotel 1 75 2015
## 11: Resort Hotel 1 23 2015
## 12: Resort Hotel 0 35 2015
## 13: Resort Hotel 0 68 2015
## 14: Resort Hotel 0 18 2015
## 15: Resort Hotel 0 37 2015
## 16: Resort Hotel 0 68 2015
## 17: Resort Hotel 0 37 2015
## 18: Resort Hotel 0 12 2015
## 19: Resort Hotel 0 0 2015
## 20: Resort Hotel 0 7 2015
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## 1: July 27 1
## 2: July 27 1
## 3: July 27 1
## 4: July 27 1
## 5: July 27 1
## 6: July 27 1
## 7: July 27 1
## 8: July 27 1
## 9: July 27 1
## 10: July 27 1
## 11: July 27 1
## 12: July 27 1
## 13: July 27 1
## 14: July 27 1
## 15: July 27 1
## 16: July 27 1
## 17: July 27 1
## 18: July 27 1
## 19: July 27 1
## 20: July 27 1
## stays_in_weekend_nights stays_in_week_nights adults children babies
## 1: 0 0 2 0 0
## 2: 0 0 2 0 0
## 3: 0 1 1 0 0
## 4: 0 1 1 0 0
## 5: 0 2 2 0 0
## 6: 0 2 2 0 0
## 7: 0 2 2 0 0
## 8: 0 2 2 0 0
## 9: 0 3 2 0 0
## 10: 0 3 2 0 0
## 11: 0 4 2 0 0
## 12: 0 4 2 0 0
## 13: 0 4 2 0 0
## 14: 0 4 2 1 0
## 15: 0 4 2 0 0
## 16: 0 4 2 0 0
## 17: 0 4 2 0 0
## 18: 0 1 2 0 0
## 19: 0 1 2 0 0
## 20: 0 4 2 0 0
## meal country market_segment distribution_channel is_repeated_guest
## 1: BB PRT Direct Direct 0
## 2: BB PRT Direct Direct 0
## 3: BB GBR Direct Direct 0
## 4: BB GBR Corporate Corporate 0
## 5: BB GBR Online TA TA/TO 0
## 6: BB GBR Online TA TA/TO 0
## 7: BB PRT Direct Direct 0
## 8: FB PRT Direct Direct 0
## 9: BB PRT Online TA TA/TO 0
## 10: HB PRT Offline TA/TO TA/TO 0
## 11: BB PRT Online TA TA/TO 0
## 12: HB PRT Online TA TA/TO 0
## 13: BB USA Online TA TA/TO 0
## 14: HB ESP Online TA TA/TO 0
## 15: BB PRT Online TA TA/TO 0
## 16: BB IRL Online TA TA/TO 0
## 17: BB PRT Offline TA/TO TA/TO 0
## 18: BB IRL Online TA TA/TO 0
## 19: BB FRA Corporate Corporate 0
## 20: BB GBR Direct Direct 0
## previous_cancellations previous_bookings_not_canceled
## 1: 0 0
## 2: 0 0
## 3: 0 0
## 4: 0 0
## 5: 0 0
## 6: 0 0
## 7: 0 0
## 8: 0 0
## 9: 0 0
## 10: 0 0
## 11: 0 0
## 12: 0 0
## 13: 0 0
## 14: 0 0
## 15: 0 0
## 16: 0 0
## 17: 0 0
## 18: 0 0
## 19: 0 0
## 20: 0 0
## reserved_room_type assigned_room_type booking_changes deposit_type
## 1: C C 3 No Deposit
## 2: C C 4 No Deposit
## 3: A C 0 No Deposit
## 4: A A 0 No Deposit
## 5: A A 0 No Deposit
## 6: A A 0 No Deposit
## 7: C C 0 No Deposit
## 8: C C 0 No Deposit
## 9: A A 0 No Deposit
## 10: D D 0 No Deposit
## 11: E E 0 No Deposit
## 12: D D 0 No Deposit
## 13: D E 0 No Deposit
## 14: G G 1 No Deposit
## 15: E E 0 No Deposit
## 16: D E 0 No Deposit
## 17: E E 0 No Deposit
## 18: A E 0 No Deposit
## 19: A G 0 No Deposit
## 20: G G 0 No Deposit
## agent company days_in_waiting_list customer_type adr
## 1: NULL NULL 0 Transient 0.00
## 2: NULL NULL 0 Transient 0.00
## 3: NULL NULL 0 Transient 75.00
## 4: 304 NULL 0 Transient 75.00
## 5: 240 NULL 0 Transient 98.00
## 6: 240 NULL 0 Transient 98.00
## 7: NULL NULL 0 Transient 107.00
## 8: 303 NULL 0 Transient 103.00
## 9: 240 NULL 0 Transient 82.00
## 10: 15 NULL 0 Transient 105.50
## 11: 240 NULL 0 Transient 123.00
## 12: 240 NULL 0 Transient 145.00
## 13: 240 NULL 0 Transient 97.00
## 14: 241 NULL 0 Transient 154.77
## 15: 241 NULL 0 Transient 94.71
## 16: 240 NULL 0 Transient 97.00
## 17: 8 NULL 0 Contract 97.50
## 18: 240 NULL 0 Transient 88.20
## 19: NULL 110 0 Transient 107.42
## 20: 250 NULL 0 Transient 153.00
## required_car_parking_spaces total_of_special_requests
## 1: 0 0
## 2: 0 0
## 3: 0 0
## 4: 0 0
## 5: 0 1
## 6: 0 1
## 7: 0 0
## 8: 0 1
## 9: 0 1
## 10: 0 0
## 11: 0 0
## 12: 0 0
## 13: 0 3
## 14: 0 1
## 15: 0 0
## 16: 0 3
## 17: 0 0
## 18: 0 0
## 19: 0 0
## 20: 0 1
## reservation_status reservation_status_date
## 1: Check-Out 2015-07-01
## 2: Check-Out 2015-07-01
## 3: Check-Out 2015-07-02
## 4: Check-Out 2015-07-02
## 5: Check-Out 2015-07-03
## 6: Check-Out 2015-07-03
## 7: Check-Out 2015-07-03
## 8: Check-Out 2015-07-03
## 9: Canceled 2015-05-06
## 10: Canceled 2015-04-22
## 11: Canceled 2015-06-23
## 12: Check-Out 2015-07-05
## 13: Check-Out 2015-07-05
## 14: Check-Out 2015-07-05
## 15: Check-Out 2015-07-05
## 16: Check-Out 2015-07-05
## 17: Check-Out 2015-07-05
## 18: Check-Out 2015-07-02
## 19: Check-Out 2015-07-02
## 20: Check-Out 2015-07-05
From the following table and piechart we can see that 3.19% of the guests are repeated guests, the rest are unrepeated guests. Since people tend not to visit a certain place over and over again, so these numbers make sense.
hotels %>%
group_by(is_repeated_guest) %>%
count(is_repeated_guest) %>%
summarise(percent = round(n/nrow(hotels)*100,2), total = n)
## # A tibble: 2 x 3
## is_repeated_guest percent total
## <fct> <dbl> <int>
## 1 0 96.8 115580
## 2 1 3.19 3810
slices <- c(sum(hotels$is_repeated_guest=="1"),sum(hotels$is_repeated_guest=="0"))
lbls <- c("Repeated guest","Unrepeated guest")
pie3D(slices,labels=lbls,explode=0.1)
In order to explore several bahavior of repeated guests, I created the following two functions to disploay the visualization.
guestbehabar <- function(behavior){
hotels %>%
ggplot(aes(is_repeated_guest, fill = behavior)) +
geom_bar(position = "fill") +
labs(title = "Behavior feature by guest type",
subtitle = behavior,
x = "Guest type (1 for repeated guests)",
y = "Percentage of group") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
}
guestbehaboxplot <- function(behavior){
hotels %>%
ggplot(aes(x = is_repeated_guest, y = behavior)) +
geom_boxplot() +
geom_jitter(width = .15, alpha = .2) +
labs(title = "Behavior feature by guest type",
subtitle = behavior,
x = "Guest type (1 for repeated guests)",
y = "count") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
}
The lead time for the repeated guests are shorter than that of unrepeated guests significantly. The average lead time for the repeated guests is 31 days, while that for unrepeated guests is 106 days.
guestbehaboxplot(hotels$lead_time)
hotels %>%
group_by(is_repeated_guest) %>%
summarise(mean = mean(lead_time))
## # A tibble: 2 x 2
## is_repeated_guest mean
## <fct> <dbl>
## 1 0 106.
## 2 1 30.8
43672 unrepeated guests cancel booking, which counts for 37.8% of the total unrepeated guests. On the other hand, 552 repeated guests cancel booking, which only accounts for 14.4% of the repeated group.
guestbehabar (hotels$is_canceled)
hotels %>%
group_by(is_repeated_guest) %>%
filter(is_canceled == "1") %>%
count() -> filcan
hotels %>%
group_by(is_repeated_guest) %>%
count() -> total
as.data.frame(filcan/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.3778508
## 2 NA 0.1448819
Yes. 75.6% repeated guests don’t have cancelling records. The percentage for unrepeated guests is 95.2%. This makes sense because unrepeated guests have a unrepeating fashion. The mean cancellation for repeated guests and unrepeated guests are 0.47 and 0.07, respectively.
guestbehaboxplot(hotels$previous_cancellations)
hotels %>%
group_by(is_repeated_guest) %>%
filter(previous_cancellations == 0) %>%
count() -> filpre
as.data.frame(filpre/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.9519207
## 2 NA 0.7566929
hotels %>%
group_by(is_repeated_guest) %>%
summarise(mean = mean(previous_cancellations))
## # A tibble: 2 x 2
## is_repeated_guest mean
## <fct> <dbl>
## 1 0 0.0745
## 2 1 0.470
74.5% bookings made by repeated guests haven’t been cancelled. The percentage for unrepeated guests is only 0.7%. The average previous bookings that haven’t been canceled for repeated guests and unrepeated guests are 3.59 and 0.02, respectively.
guestbehaboxplot(hotels$previous_bookings_not_canceled)
hotels %>%
group_by(is_repeated_guest) %>%
filter(previous_bookings_not_canceled > 0) %>%
count() -> filno
as.data.frame(filno/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.006765876
## 2 NA 0.744881890
hotels %>%
group_by(is_repeated_guest) %>%
summarise(mean = mean(previous_bookings_not_canceled))
## # A tibble: 2 x 2
## is_repeated_guest mean
## <fct> <dbl>
## 1 0 0.0234
## 2 1 3.59
It turns out that repeated guests prefer City Hotel other than Resort Hotel. However, the percentage of repeated guests that book Resort Hotel is 46.7%, while only 33.1% of unrepeated guests book Resort Hotel.
guestbehabar(hotels$hotel)
hotels %>%
group_by(is_repeated_guest) %>%
filter(hotel == "Resort Hotel") %>%
count() -> filhot
as.data.frame(filhot/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.3312165
## 2 NA 0.4666667
81.1% of repeated guests don’t make change of booking. This number is slightly lower that that of unrepeated guests.
guestbehaboxplot(hotels$booking_changes)
hotels %>%
group_by(is_repeated_guest) %>%
filter(booking_changes == 0) %>%
count() -> filchan
as.data.frame(filchan/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.8498183
## 2 NA 0.8115486
By comparing the meal choice of unrepeated guests, repeated guests have high percentage that choose Bed & Breakfast, lower percentage for all the other three type of meal plan.
guestbehabar(hotels$meal)
table(hotels$meal, hotels$is_repeated_guest)
##
## 0 1
## BB 88837 3473
## FB 789 9
## HB 14277 186
## SC 11677 142
hotels %>%
group_by(is_repeated_guest) %>%
filter(meal == "BB") %>%
count() -> filmea
as.data.frame(filmea/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.7686191
## 2 NA 0.9115486
Normally no. 98.2% of repeated guests don’t make deposit. A possiable reason is that they are reliable guests, so that they don’t need to make deposit. The percentage for unrepeated guests is 87.3%.
guestbehabar(hotels$deposit_type)
hotels %>%
group_by(is_repeated_guest) %>%
filter(deposit_type == "No Deposit") %>%
count() -> fildep
as.data.frame(fildep/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.8729798
## 2 NA 0.9821522
More than half of them don’t. 58.2% of repeated guests don’t make special requests. This percentage is very close to that of unrepeated guests.
guestbehabar(factor(hotels$total_of_special_requests))
hotels %>%
group_by(is_repeated_guest) %>%
filter(total_of_special_requests == 0) %>%
count() -> filspe
as.data.frame(filspe/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.5892109
## 2 NA 0.5818898
80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking. 4.2% of the repeated guests make bookings that are associated to a group. Both of these two numbers are higher than that of unrepeated guests making the bookings.
guestbehabar(hotels$customer_type)
custype <- function(type){
hotels %>%
group_by(is_repeated_guest) %>%
filter(customer_type == type) %>%
count()
}
custype("Transient") -> filtra
custype("Group") -> filgro
as.data.frame(filtra/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.7487455
## 2 NA 0.8065617
as.data.frame(filgro/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.003599239
## 2 NA 0.042257218
Only 3.19% of the guests are repeated guests.
Repeated guests tend to book the hotel one month ahead of visiting, which is much shorter than that made by unrepeated guests. This indicates repeated guests don’t rush to book hotels may because they always know which hotel to book if visiting that place.
The probability of a repeated guest cancel a booking is much lower than that of a unrepeated guest does. This informs that repeated guests are much loyal.
75.6% repeated guests don’t have cancelling records previously. 74.5% bookings made by repeated guests haven’t been cancelled. The average previous bookings that haven’t been canceled is as high as 3.59.
Repeated guests prefer City Hotel other than Resort Hotel slightly. However, the percentage of choosing Resort Hotel is higher than unrepeated guests do.
81.1% of repeated guests don’t make change of booking.
91.1% of repeated guests choose Bed & Breakfast when booking hotels. Higher than unrepeated guests do.
98.2% of repeated guests don’t make deposit. Higher than unrepeated guests do.
More than half of repeated guests don’t make special requests, almost like unrepeated guests do.
80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking.