Data Wrangling Mid Term Project

Akash Khanvilkar, Guna Shanmukha Gedela, Sindhu Herle

Hotel Bookings Analysis

Introduction

Background

Hospitality generates revenue for local economies directly when tourists spend money in hotels, restaurants and entertainment venues. Hospitality industry is growing, with more and more people spending their money for vacation and leisure activities. People may only lodge into a hotel when it’s a holiday season or a special event, thus the demand for staying in room is not equally distributed across the year. Hotel industry is a very volatile industry and the bookings depend on variety of factors such as type of hotels, seasonality, days of week and many more. This makes analyzing the patterns available in the past data more important to help the hotels plan better. Using the historical data, hotels can perform various campaigns to boost the business. The data consists of around 119,390 booking transactions from 2 hotel: an anonymous city hotel from Lisbon and a resort hotel from Algarve. The dataset comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. There is so much to explore from this data, but we will only focus on demand forecasting.

Proposed Analytical Methodology

We will tackle this problem statement in three segments:

  1. We will be analyzing some key metrics for hotel bookings like: Most preferred meal types, country wise bookings, new customers acquired, the number of cancellations, number of bookings on weekday vs weekends, customer lifetime value of the existing customers, type of rooms preferred by customers, booking types, hotels available for booking and the revenue of the hotels
  2. Further, we will be using various angles to look through the data to analyze patterns associated with each segment such as: The type of hotel, day of week, type of customers,type of rooms
  3. Furthermore, we will also try to predict the future bookings either based on time series analysis or decision trees.

Benefit of Analysis

The project aims to gain interesting insight into customers’ behavior when booking a hotel. To maximize the revenue gained by the hotel, the management often employed a pricing strategy, one of them being raising the room rate when the demand is high and making a promo when the demand is low. Thus, the ability to accurately forecast the future demand is very important and became a vital part on the pricing scheme. The demand for different segment of customer may differ and forecasting become harder as it may requires different model for different segment.These insights can guide hotels to adjust their customer strategies and make preparation for unknown.

Packages Required

  1. Rpart, rpart.plot and ROCR : These packages are used for building classification and regression models using decision trees. Further, we can visualize the tree structure and evaluate the performance of the models.

  2. Tidyverse : This package consists of 6 core packages out of which the below 3 are most important for this project: dplyr: Used for data manipulation tidyr: Used for data modifications ggplot2: Used for creating powerful visualizations.

  3. Forecast, tseries and sarima : These packages are used to model the time-series data including the seasonal component in the series (if any).

#Importing the packages required

library(rpart) #used for classification trees
library(rpart.plot) #used for plotting the trees
library(tidyverse) #used for data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(rmarkdown) #used for formatting the markdown file
library(lubridate) #Used for manipulating dates
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tseries) #used for time-series forecasting
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(sarima) #used for time-series forecasting including seasonal components
## Loading required package: stats4

Data Dictionary

The data set contains the following variables:

  • hotel : Type of Hotel whether Resort Hotel(H1) or City Hotel(H2)
  • is_canceled : Value indicating if the booking was canceled (1) or not (0)
  • lead_time : Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
  • arrival_date_year : Year of arrival date
  • arrival_date_month : Month of arrival date
  • arrival_date_week_number : week number of year for arrival date
  • arrival_date_day_of_month : Day of arrival date
  • stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • adults : Number of adults
  • children : Number of children
  • babies : Number of babies
  • meal : Type of meal booked. Undefined/SC – no meal package, BB – Bed & Breakfast, HB – Half board (breakfast and one other meal – usually dinner), FB – Full board (breakfast, lunch and dinner)
  • country : Country of origin.
  • market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
  • distribution_channel : Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
  • is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)
  • previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking
  • previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking
  • reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons
  • assigned_room_type : Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request.
  • booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
  • deposit_type : Type of deposit made for booking: No Deposit – no deposit was made,Non Refund – a deposit was made in the value of the total stay cost, Refundable – a deposit was made with a value under the total cost of stay.
  • agent : ID of the travel agency that made the booking
  • company : ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
  • days_in_waiting_list : Number of days the booking was in the waiting list before it was confirmed to the customer
  • customer_type : Type of booking :Contract - when the booking has an allotment or other type of contract associated to it, Group – when the booking is associated to a group, Transient – when the booking is not part of a group or contract, and is not associated to other transient booking, Transient-party – when the booking is transient, but is associated to at least other transient booking
  • adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
  • required_car_parking_spaces : Number of car parking spaces required by the customer
  • total_of_special_requests : Number of special requests made by the customer (e.g. twin bed or high floor)
  • reservation_status : Reservation last status:Canceled – booking was canceled by the customer,Check-Out – customer has checked in but already departed, No-Show – customer did not check-in and did inform the hotel of the reason why
  • reservation_status_date : Date at which the last status was set. This variable can be used in conjunction with the Reservation Status to understand when was the booking canceled or when did the customer checked-out of the hotel

Data Preparation

The source data can be found here: Hotel Data

Data Structure

This data set contains 119390 observations and 32 variables. Some of the variables are integer, others are character. It includes information about hotel names, lead time, number of children, agent, and much more.

#Importing the dataset
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(hotels)
## spec_tbl_df [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>
#Structure of the dataset
str(hotels)
## spec_tbl_df [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

Missing values

By checking missing values for each variables, we can see that there are only 4 missing value for variable “children”, no missing value in other variables.

colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

Taking care of the missing value

hotels$children[is.na(hotels$children)] <- median(hotels$children, na.rm = TRUE)

Exploratory Data Analysis

Total bookings across years

hotels %>% 
  group_by(is_repeated_guest) %>% 
  count(is_repeated_guest) %>% 
  summarise(percent = round(n/nrow(hotels)*100,2), total = n)
## # A tibble: 2 × 3
##   is_repeated_guest percent  total
##               <dbl>   <dbl>  <int>
## 1                 0   96.8  115580
## 2                 1    3.19   3810

Leading time difference between two groups of guests

guestbehabar <- function(behavior){
   hotels %>%
    ggplot(aes(is_repeated_guest, fill = behavior)) +
    geom_bar(position = "fill") +
    labs(title = "Behavior feature by guest type", 
         subtitle = behavior,
         x = "Guest type (1 for repeated guests)", 
         y = "Percentage of group") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))
}
behaviourboxplot <- function(behavior){
   hotels %>%
    ggplot(aes(x = is_repeated_guest, y = behavior)) +
    geom_boxplot() +
    geom_jitter(width = .15, alpha = .2) +
    labs(title = "Behavior feature by guest type",
         subtitle = behavior,
         x = "Guest type (1 for repeated guests)", 
         y = "count") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))
}
behaviourboxplot(hotels$lead_time)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Type of hotel booked by repeated guests

guestbehabar(hotels$hotel)

It turns out that repeated guests prefer City Hotel other than Resort Hotel. However, the percentage of repeated guests that book Resort Hotel is 46.7%, while only 33.1% of unrepeated guests book Resort Hotel.

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(hotel == "Resort Hotel") %>% 
  count() -> filhot       
hotels %>% 
  group_by(is_repeated_guest) %>% 
  count() -> total

as.data.frame(filhot/total)
##   is_repeated_guest         n
## 1               NaN 0.3312165
## 2                 1 0.4666667

Repeated guests cancelling booking

hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
hotels$is_canceled <- as.factor(hotels$is_canceled)
guestbehabar(hotels$is_canceled)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(is_canceled == "1") %>% 
  count() -> filcan

hotels %>% 
  group_by(is_repeated_guest) %>% 
  count() -> total

as.data.frame(filcan/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.3778508
## 2                NA 0.1448819

Meal choice of repeated guests

guestbehabar(hotels$meal)

table(hotels$meal, hotels$is_repeated_guest)
##            
##                 0     1
##   BB        88837  3473
##   FB          789     9
##   HB        14277   186
##   SC        10540   110
##   Undefined  1137    32
hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(meal == "BB") %>% 
  count() -> filmea     

as.data.frame(filmea/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.7686191
## 2                NA 0.9115486

Percentage of repeated guests making deposit

guestbehabar(hotels$deposit_type)

Normally 98.2% of repeated guests don’t make deposit. A possible reason is that they are reliable guests, so that they don’t need to make deposit. The percentage for unrepeated guests is 87.3%.

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(deposit_type == "No Deposit") %>% 
  count() -> fildep       

as.data.frame(fildep/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.8729798
## 2                NA 0.9821522

Customer type of repeated guests

guestbehabar(hotels$customer_type)

80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking. 4.2% of the repeated guests make bookings that are associated to a group. Both of these two numbers are higher than that of unrepeated guests making the bookings.

custype <- function(type){
  hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(customer_type == type) %>% 
  count()
}

custype("Transient") -> filtra
custype("Group") -> filgro

as.data.frame(filtra/total) 
## Warning in Ops.factor(left, right): '/' not meaningful for factors
##   is_repeated_guest         n
## 1                NA 0.7487455
## 2                NA 0.8065617

Summary

  • The probability of a repeated guest cancel a booking is much lower than that of a unrepeated guest does. This informs that repeated guests are much loyal.

  • 81.1% of repeated guests don’t make change of booking.

  • 98.2% of repeated guests don’t make deposit. Higher than unrepeated guests do.

  • 80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking.

  • Only 3.19% of the guests are repeated guests.

  • Repeated guests tend to book the hotel one month ahead of visiting, which is much shorter than that made by unrepeated guests. This indicates repeated guests don’t rush to book hotels may because they always know which hotel to book if visiting that place.