Hotel-Bookings-copy.utf8

Hotel Bookings

Mijia Li

April 1, 2020

Introduction

As the development of the world, people seem to need more and more travels nowadays. Investing a hotel seems to be a lucrative business. However, people sometimes will cancel their order and the cancellation of customer orders will influence the profit. How should you predict the cancellation of customer orders? How should you manage your hotel in different season? I believe that you will obtain a lot of knowledge about how to manage your hotel better through this analysis.

This hotel booking demand dataset contains actual data about hotel booking information from two hotels. I want to analyze data based on the two hotels to identify trends in people to cancel an order and to determine whether there are some ways to predict the cancellation of customer orders.

I will conduct univariable analyses for most variables and explore connections in variables by univariable and bivariable graphs. In addition, I will use machine learning techniques to build a model to predict which variables are related to the cancellation of customer orders.

This analysis would benefit the hotel manager and the person who would like to open a hotel. They will know how to predict the cancellation of customer orders. They can also realize customer segmentation, distribution channel, seasonality and the approximate demand of room type and meals.

Packages Required

These packages are required to load and manipulate the data.

library(tidyverse) # a combination of R packages to model, transform and visualize data
library(readr) # provide a fast and friendly way to read rectangular data (like 'csv', 'tsv', and 'fwf')
library(dplyr) # manipulate data
library(forcats) # handle categorical variables
library(DT) # create tables
library(knitr) # display tables
library(ggplot2) # visualizations

Data Preparation

My dataset is a combination of two datasets with hotel demand data. One of the hotels is a resort hotel and the other is a city hotel. There are 31 variables and 119390 observations in the original dataset. The original purpose of the data is for revenue management. The dataset consists booking due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted. The dataset I downloaded from March 25, 2020 can be found here.

I imported the csv file after downloading the data.

hotel <- read.csv('Spring 2020/BANA 7025/Week 3/hotels.csv', 
                       stringsAsFactors = FALSE)

I removed some columns because other columns gave better information or the columns do not provide much useful analytic information.

arrival_date_week_number column had too many levels and arrival_date_month can provide enough information for analyzing seasonality, so I removed the arrival_date_week_number column.
adults, children, and babies columns had too many levels and assigned_room_type can provide good information for the analysis, so I removed adults, children, and babies columns.
agent and company had a lot of NULL values and had too many levels, so I removed them.
reserved_room_type, days_in_waiting_list, booking_changes, stays_in_weekend_nights, stays_in_week_nights, required_car_parking_spaces, country, adr and total_of_special_requests columns do not provide much useful analytic information, so I removed them.
market_segment can provide better information than customer_type, so I removed customer_type column.
is_canceled can provide better information than reservation_status, so I removed reservation_status column.
reservation_status_date do not have useful information since there are stays_in_weekend_nights and stays_in_week_nights columns, so I removed reservation_status_date column.
previous_cancellations column can provide better information than is_repeated_guest and previous_bookings_not_canceled columns, so I removed is_repeated_guest and previous_bookings_not_canceled columns.

drop.cols = c("arrival_date_week_number", "is_repeated_guest", "previous_bookings_not_canceled", "adults", "children", "babies", "stays_in_weekend_nights", "stays_in_week_nights", "reserved_room_type", "booking_changes", "agent", "company", "days_in_waiting_list", "customer_type","total_of_special_requests", "reservation_status", "reservation_status_date", "country", "required_car_parking_spaces", "adr")
hotel2 <- hotel %>% select(-one_of(drop.cols))

I used glimpse function from the dplyr package to reveal the variables and data type for each column.

glimpse(hotel2)

## Observations: 119,390
## Variables: 12
## $ hotel                     <chr> "Resort Hotel", "Resort Hotel", "Resort Hot…
## $ is_canceled               <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0…
## $ lead_time                 <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, 23, …
## $ arrival_date_year         <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
## $ arrival_date_month        <chr> "July", "July", "July", "July", "July", "Ju…
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ meal                      <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB", "…
## $ market_segment            <chr> "Direct", "Direct", "Direct", "Corporate", …
## $ distribution_channel      <chr> "Direct", "Direct", "Direct", "Corporate", …
## $ previous_cancellations    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ assigned_room_type        <chr> "C", "C", "C", "A", "A", "A", "C", "C", "A"…
## $ deposit_type              <chr> "No Deposit", "No Deposit", "No Deposit", "…

For variable market_segment, there are 8 categories. For Undefined category, only two observations here, so I removed those two observations since my original dataset has 119,390 observations.

hotel2 <- hotel2[!hotel2$market_segment == "Undefined", ]

I condensed the 7 categories in the market_segment variable into 5 after removing the Undefined category: “Direct”, “Corporate”, “TA/TO”, “Groups”, and “Other”.

hotel2 <- hotel2 %>%
  mutate(market_segment = ifelse(str_detect(market_segment, "Direct")==TRUE, "Direct",
                                 ifelse(str_detect(market_segment,"Corporate")==TRUE, "Corporate",
                                        ifelse(str_detect(market_segment,"Online TA")==TRUE, "TA/TO",
                                               ifelse(str_detect(market_segment, "Offline TA/TO")==TRUE, "TA/TO",
                                                      ifelse(str_detect(market_segment, "Groups")==TRUE, "Groups",
                                                             "Other"))))))

For variable market_segment, there are 5 categories. For Undefined category, only five observations here, so I removed those 5 observations since my original dataset has 119,390 observations.

hotel2 <- hotel2[!hotel2$distribution_channel == "Undefined", ]

I condensed the 4 categories in the distribution_channel variable into 3 after removing the Undefined category: “Direct”, “TA/CO”, and “GDS”. By this way, I combined the “Direct” and “Group” to “Direct”. Because “Group” channel is also booking the hotel directly.

hotel2 <- hotel2 %>%
  mutate(distribution_channel = ifelse(str_detect(distribution_channel, "TA/TO")==TRUE, "TA/TO",
                                       ifelse(str_detect(distribution_channel,"GDS")==TRUE, "GDS",
                                              "Direct")))

For variable meal, Both Undefined and SC means that no meal package booked. I condensed the 5 categories in the meal variable into 4 by combing Undefined category to SC: “SC”, “HB”, “FB”.

hotel2 <- hotel2 %>%
  mutate(meal = ifelse(str_detect(meal, "HB")==TRUE, "HB",
                       ifelse(str_detect(meal,"FB")==TRUE, "FB",
                              ifelse(str_detect(meal,"BB")==TRUE, "BB",
                                     "SC"))))

One of my analysis is about whether the meal booking is related to the cancellation of order. I created a variable meal_booking that labels if a person booked any meal.

hotel2 <- hotel2 %>%
  mutate(meal_booking = ifelse(str_detect(meal, "SC") == TRUE, "No meal", "Meal"))

For variable previous_cancellations, 95% of observations have a value of 0, so I condensed it to 2 categories No and Yes. No means no previous cancellations and Yes means previous cancellations exist. Because my analysis is not focus on how many previous cancellations are, and I would like to analysis whether previous cancellation is related to the cancellation of current order.

hotel2 <- hotel2 %>%
  mutate(previous_cancellations = ifelse(str_detect(previous_cancellations, "0")==TRUE, "No", "Yes"))

For variable is_canceled column, I changed the 0 to No, and changed the 1 to Yes since we need clearly understanding for future visualization.

hotel2 <- hotel2 %>%
  mutate(is_canceled = ifelse(str_detect(is_canceled, "0")==TRUE, "No", "Yes"))

I converted some categorical variables, integer variables and numeric variables to factors using the as_factor function.

hotel2 <- hotel2 %>%
  mutate(hotel = as_factor(hotel),
         is_canceled = as_factor(is_canceled),
         arrival_date_month = as_factor(arrival_date_month),
         arrival_date_day_of_month = as_factor(arrival_date_day_of_month),
         meal = as_factor(meal),
         market_segment = as_factor(market_segment),
         distribution_channel = as_factor(distribution_channel),
         previous_cancellations = as_factor(previous_cancellations),
         assigned_room_type = as_factor(assigned_room_type),
         deposit_type = as_factor(deposit_type),
         meal_booking = as_factor(meal_booking),
  )

Here is a glimpse of all variables now after data cleaning and data type modification.

glimpse(hotel2)

## Observations: 119,385
## Variables: 13
## $ hotel                     <fct> Resort Hotel, Resort Hotel, Resort Hotel, R…
## $ is_canceled               <fct> No, No, No, No, No, No, No, No, Yes, Yes, Y…
## $ lead_time                 <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, 23, …
## $ arrival_date_year         <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
## $ arrival_date_month        <fct> July, July, July, July, July, July, July, J…
## $ arrival_date_day_of_month <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ meal                      <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB, BB,…
## $ market_segment            <fct> Direct, Direct, Direct, Corporate, TA/TO, T…
## $ distribution_channel      <fct> Direct, Direct, Direct, Direct, TA/TO, TA/T…
## $ previous_cancellations    <fct> No, No, No, No, No, No, No, No, No, No, No,…
## $ assigned_room_type        <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, G, E…
## $ deposit_type              <fct> No Deposit, No Deposit, No Deposit, No Depo…
## $ meal_booking              <fct> Meal, Meal, Meal, Meal, Meal, Meal, Meal, M…

Below is a preview of the cleaned data.

hotel2 %>%
  head(150) %>%
  datatable()

Blow is a table of the variable names, data types, and variable descriptions.

hotel2.type <- lapply(hotel2, class)
hotel2.var_desc <- c("Name of hotel",
                     "Indicates if the order is canceled or not",
                     "Number of days that elapsed between booking date and arrival date",
                     "Year of arrival date",
                     "Month of arrival date",
                     "Day of the month of the arrival date",
                     "Type of meal booked",
                     "Market segment designation",
                     "Booking distribution channel",
                     "Indicates if the guest has the canceled order before",
                     "Code for the type of room assigned to the booking",
                     "Deposit type of booking",
                     "Indicates if the customer booked any meals")
hotel2.var_name <- colnames(hotel2)
data.description <- as_tibble(cbind(hotel2.var_name, hotel2.type, hotel2.var_desc))
colnames(data.description) <- c("Variable Name", "Data Type", "Variable Description")
kable(data.description)

Variable Name	Data Type	Variable Description
hotel	factor	Name of hotel
is_canceled	factor	Indicates if the order is canceled or not
lead_time	integer	Number of days that elapsed between booking date and arrival date
arrival_date_year	integer	Year of arrival date
arrival_date_month	factor	Month of arrival date
arrival_date_day_of_month	factor	Day of the month of the arrival date
meal	factor	Type of meal booked
market_segment	factor	Market segment designation
distribution_channel	factor	Booking distribution channel
previous_cancellations	factor	Indicates if the guest has the canceled order before
assigned_room_type	factor	Code for the type of room assigned to the booking
deposit_type	factor	Deposit type of booking
meal_booking	factor	Indicates if the customer booked any meals

Proposed Exploratory Data Analysis

Univariate Analyses

Seasonality

Ignore the hotel type, July and August are the two months that contain higher customer bookings. For resot hotel, it is obvious that the hotel booking is higher on July and August. For city hotel, the hotel booking is higher from April to October.

Month <- hotel2[hotel2$arrival_date_month != " ", ] %>%
  ggplot(aes(x=factor(arrival_date_month, levels = c("January", "February", "March","April","May",
  "June","July","August","September","October","November","December"))))+
  geom_bar() +
  scale_x_discrete(name = "Months") +
  scale_y_continuous(name = "Frequency", breaks = seq(0,15000,1500)) +
  ggtitle("Customer Arrival Months of Two Types of Hotel") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5), axis.text.x = element_text(angle=90, vjust=.5))
  Month+ geom_bar(aes(fill = hotel))

Above bar chart showed that August is the best month for hotel since they can receive the biggest amount of booking on August. If you are planning to open a hotel, you may interest in how many bookings in each day on August. I filtered the data for the only year in 2017. You can see the highest amount in the city hotel is 139, and the highest amount in the resort hotel is 89.

hotel2 %>%
  filter(arrival_date_month == "August") %>%
  filter(hotel == "City Hotel") %>%
  filter(arrival_date_year == "2017") %>%
  ggplot(aes(x = arrival_date_day_of_month,)) +
  geom_bar(fill = 'light blue') +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Days") +
  scale_y_continuous(name = "Frequency", breaks = seq(0,15000,1500)) +
  ggtitle("Hotel Bookings on August 2017 for City Hotel") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

hotel2 %>%
  filter(arrival_date_month == "August") %>%
  filter(hotel == "Resort Hotel") %>%
  filter(arrival_date_year == "2017") %>%
  ggplot(aes(x = arrival_date_day_of_month,)) +
  geom_bar(fill = 'light green') +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Days") +
  scale_y_continuous(name = "Frequency", breaks = seq(0,15000,1500)) +
  ggtitle("Hotel Bookings on August 2017 for Resort Hotel") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Market Segment

For the market segment, 68% of customers are the TA/TO (“TA” means “Travel Agents” and “TO” means “Tour Operators”), 17% of the customers are the groups, 11% of the customers are the direct customers and 4% of the customers are corporates.

Market <- hotel2[hotel2$market_segment != " ", ] %>%
  ggplot(aes(x=market_segment)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Market Segment") +
  scale_y_continuous(name = "Frequency", breaks = seq(0,80000,8000)) +
  ggtitle("Market Segment of Two Types of Hotel") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))
  Market+ geom_bar(aes(fill = hotel))

Distribution Channel

There are three types of distribution channels, 82% of the distribution channel is TA/TO, 18% of the distribution channels is direct booking and the GDS distribution channel is less than 1%.

Distribution <- hotel2[hotel2$distribution_channel != " ", ] %>%
  ggplot(aes(x=distribution_channel)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Distribution Channel") +
  scale_y_continuous(name = "Frequency", breaks = seq(0,99000,9000)) +
  ggtitle("Distribution Channel of Two Types of Hotel ") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))
  Distribution+ geom_bar(aes(fill = hotel))

Room Type

Ignore the hotel type, 62% of the room type is A, 21% of the room type is D and 6% of the room type is E. For the city hotel, the room type K is needed. For the resort hotel, the room type I and H are needed, and resort hotel need more room type C, E and G than city hotel.

Room <- hotel2[hotel2$assigned_room_type != " ", ] %>%
  ggplot(aes(x=assigned_room_type)) +
  geom_bar() +
  scale_x_discrete(name = "Assigned Room Type") +
  scale_y_continuous(name = "Frequency", breaks = seq(0,75000,7500)) +
  ggtitle("Assigned Room Type of Two Types of Hotel") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))
  Room+ geom_bar(aes(fill = hotel))

Meals

There are four types of meal: BB, FB, HB and SC. BB means Bed & Breakfast, HB means Half board (breakfast and one other meal – usually dinner), FB means Full board (breakfast, lunch and dinner) and SC means no meal package. 77% of the customers chose BB, 12% of the customers chose HB, 10% of the customer did not order any meal and a very small number of customers chose FB.

Meal <- hotel2[hotel2$meal != " ", ] %>%
  ggplot(aes(x=meal)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Meals") +
  scale_y_continuous(name = "Frequency", breaks = seq(0,99000,10000)) +
  ggtitle("Meals of Two Types of Hotel") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))
  Meal+ geom_bar(aes(fill = hotel))

Cancellation of Hotel Booking

How should you predict the cancellation of customer orders? Let’s see the ratio of cancellation first. For the city hotel, 42% of the customers canceled their order. For the resort hotel, 28% of the customers canceled their order. We can see that there isn’t a small number of customers that cancelled their orders. So, let’s analyze the cancellation by different variables.

    hotel2%>%
      ggplot(aes(x=is_canceled, fill = factor(hotel))) +
      geom_bar(position = "dodge") +
      geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
      scale_x_discrete(name = "Cancellations Exist") +
      scale_y_continuous(name = "Frequency",breaks = seq(0,80000,8000)) +
      ggtitle("Count Cancellation in the Two Hotels") +
      theme(plot.title = element_text(face = "bold", hjust = 0.5))

Is there any relationship between meal booking and cancellation? For customers who ordered meal, the cancellation is 37%. For customers who did not ordered meal, the cancellation is 36%. So, there is no relationship between meal booking and cancellation.

hotel2 %>%
  ggplot(aes(x=meal_booking, fill = factor(is_canceled))) +
  geom_bar(position = "dodge") +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Meal Booking") +
  scale_y_continuous(name = "Frequency",breaks = seq(0,80000,8000)) +
  ggtitle("Cancellation by Meal Booking") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Is there any relationship between hotel type and cancellation? For resort hotel, the cancellation is 28%. For city hotel, the cancellation is 42%. So, there are more customers would like to cancel the booking from the city hotel than resort hotel.

hotel2 %>%
  ggplot(aes(x=hotel, fill = factor(is_canceled))) +
  geom_bar(position = "dodge") +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Hotel") +
  scale_y_continuous(name = "Frequency",breaks = seq(0,80000,8000)) +
  ggtitle("Cancellation by Hotel") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Is there any relationship between market segment and cancellation? It’s obvious that for the groups, the cancellation is more than 50%, and 61% of the customers cancelled their orders. For direct ordering customers, the cancellation is 15%. For corporate ordering customers, the cancellation is 16%. For TA/TO ordering customers, the cancellation is 36%. So, Groups and TA/TO customers will more likely to cancel the booking than other market segments.

hotel2%>%
  ggplot(aes(x=market_segment, fill = factor(is_canceled))) +
  geom_bar(position = "dodge") +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Market Segment") +
  scale_y_continuous(name = "Frequency",breaks = seq(0,80000,8000)) +
  ggtitle("Cancellation by Market Segment") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Is there any relationship between previous cancellation and cancellation of current booking? For the customers that didn’t cancel the booking before, the cancellation is 34%. For the customers that ever cancelled the booking, the cancellation is 92%. It’s obvious that the customers that ever cancelled the booking will more likely to cancel the current order.

hotel2%>%
  ggplot(aes(x=previous_cancellations, fill = factor(is_canceled))) +
  geom_bar(position = "dodge") +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Previous Cancellations Exist") +
  scale_y_continuous(name = "Frequency",breaks = seq(0,80000,8000)) +
  ggtitle("Cancellation by Previous Cancellations") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Is there any relationship between deposit type and cancellation? For the no deposit type, the cancellation is 28%. For the non refund type, the cancellation is 99%. It’s obvious that the customers chose non refund deposit type will more likely to cancel the current order.

hotel2%>%
  ggplot(aes(x=deposit_type, fill = factor(is_canceled))) +
  geom_bar(position = "dodge") +
  geom_text(stat = "count", aes(label=..count..),vjust=-0.1) +
  scale_x_discrete(name = "Deposit Type") +
  scale_y_continuous(name = "Frequency", breaks = seq(0,80000,8000)) +
  ggtitle("Cancellation by Deposit Type") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Is there any relationship between lead time and cancellation? When the lead time is more than 550 days, there is an about 100% of cancellation of the booked hotel.

hotel2 %>%                      
  ggplot(aes(is_canceled,lead_time)) + 
  geom_jitter(color = 'pink') +
  scale_x_discrete(name = "Cancellation Exist") +
  scale_y_continuous(name = "Lead Time", breaks = seq(0,800,100)) +
  coord_flip() + 
  labs(title = "Cancellation by Lead Time") + 
  theme(legend.position="none", 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line = element_blank())

Summary

My analysis is focus on the problem how should you predict the cancellation of customer orders? I used the dataset that contains 31 variables and 119390 observations about two hotels’ (city hotel and resort hotel) booking data from 2015 to 2017.

My exploratory data analysis uncovered a few trends about customers who are cancelled the current booking:

More customers would like to cancel the booking from the city hotel than resort hotel.
Groups and TA/TO (“TA” means “Travel Agents” and “TO” means “Tour Operators”) customers will more likely to cancel the booking than other market segments.
The customers who ever cancelled the booking will more likely to cancel the current order, and the cancellation ratio is 92%.
The customers who chose non refund deposit type will more likely to cancel the current order, and the cancellation ratio is 99%.
When the lead time is more than 550 days, there is an about 100% of cancellation of the booked hotel.

This gives managers of hotel or the people who would like to open a new hotel good information. In my dataset, the ratio of cancellation is 42% for city hotel and 28% for resort hotel. If managers can predict the cancellation, the profit will also be increased. You can predict the cancellation based on the above cancellation ratio of lead time over 550, deposit type, previous cancellation and market segments. In addition, there are some basic information from univariable analyses. You will know that the most popular months are July and August for hotel booking, the highest booking amount on August 2017 is 139 in city hotel and 89 in resort hotel, the 68% of the customers are the TA/TO for market segment and 82% of the distribution channel is TA/TO and 77% of the customers chose Bed & Breakfast.

I believe that my analysis did not have much predictive power since my dataset just contains two hotels. However, my analysis provided good information if the two types of hotel is typically. In addition, I would like to continue improving my analysis if I got more related information in the future.