Hotel Cancellation Risk Analysis

Introduction

Overview

Hotels often overbook based on the assumption that a certain percentage of guests will cancel their reservations. Overbooking protects hotels from losses due to operating below capacity, however it also creates additional risk. If there are fewer cancellations than forecasted, hotels will be unable to provide rooms for guests which can result in the loss of loyal customers and brand tarnishment. Determining the optimal rate of overbooking serves the best interests of both hotel firms and consumers by enabling firms to maximize profits while adequately satisfying consumer needs.it is in the best interest to determine. A major hotel corporation has requested our services to determine the optimal rate of overbooking in order to minimize risk and maximize profits.

Action Plan

To determine the optimal overbooking rate for our client we will analyze comprehensive hotel booking data collected from 2015-2017. We will analyze numerous variables to determine their relationship with the cancellation rate variable.

Techniques

We will utilize a variety of approaches and analytic techniques to analyze and make sense of our data. Our techniques include exploratory data analysis, analysis of the variance, and linear regression

Utility of analysis

Gaining insight on the relationship of different variables with the cancellation rate variable will enable us to advise our client on the factors that must be considered when determining the ideal overbooking rate. Overall this will enable our client to utilize the most efficient cancellation rate to maximize profits and minimize risk.

Packages

Packages Used

tidyverse = Allows for data manipulation and works in harmony with other packages as well
ggplot2 = graphical representation in r
dplyr = data manipulation in r
library(DT) = used for displaying R data objects (matrices or data frames) as tables on HTML page

Data Preparation

Data Source

Our data was obtained from Antonio, Almeida, and Nunes (2019)

setwd("C:/Users/anger/OneDrive - University of Cincinnati/BANA 7025/BANA 7025");
data.df <- read.csv("C:/Users/anger/OneDrive - University of Cincinnati/BANA 7025/hotels.csv", stringsAsFactors = FALSE)

Explanation of Data

The data was originally collected for the role of research in various fields such as education in revenue management, machine learning, and data mining. The data is a culmination of two data sets - one on a resort hotel and the other on a city hotel. In total the data has 31 variables, with 119,390 observations. In this context, observations represent hotel bookings. The data was collected from July of 2015 to August 2017. The data appears to be relatively clean. However, different variables use inconsistent naming conventions to classify missing values. Missing values are classified as NA, undefined, and Null throughout the data. We will utilize data cleaning techniques to uniformly classify missing values and make a strategic decision on whether to omit, change, or keep missing values.

Data Cleaning Steps

We will clean the data to create consistency and ensure it can be thoroughly and efficiently analyzed.

To clean the data we will: * Ensure variables names follow a proper, uniform naming convention + Snake case was selected as the designated naming convention + Source data was received in selected naming convention

Replaced missing and NA values and consolidated categorical variables with similar meaning for the following variables:
meal: “Undefined” and “SC” were consolidated into a “none” category because the data dictionary defines these categories as the same.

data.df$meal <- replace(data.df$meal, data.df$meal=="Undefined", "none");
data.df$meal <- replace(data.df$meal, data.df$meal=="SC", "none")

children: “NA” was replaced with 0 because it is assumed that there were 0 children if this information was left NA.

data.df$children <- replace(data.df$children, data.df$children=="NA",0);

Two variables were removed because this information is irrelevant in predicting cancellations. This is because the data in these variables is recorded after the cancellation occurs.
These variables are:
reservation_status and reservation_status_date

data.df <- data.df[-c(31,32)]

The category “Undefined” in market_segment and distribution_channel were changed to “none”
distribution_channel was removed to avoid correlation between market_segment and distribution_channel in our analysis.

data.df <- data.df[-c(16)]

Null values were removed from the variables agent and company and replaced with “none”

data.df$agent <- replace(data.df$agent, data.df$agent=="NULL","none");
data.df$company <- replace(data.df$company, data.df$company=="NULL","none")

Categorical data was transformed into factor variables.
These variables were designated as categorical:
hotel, arrival_date_year, arrival_date_month, meal, country, market_segment, distribution_channel, reserved_room_type, assigned_room_type, booking_changes, deposit_type, agent, company, customer_type, reservation_status

data.df.one <- data.df
data.df.one$hotel <- as.factor(data.df$hotel);
data.df.one$arrival_date_year <- as.factor(data.df$arrival_date_year);
data.df.one$arrival_date_month <- as.factor(data.df$arrival_date_month);
data.df.one$meal <- as.factor(data.df$meal);
data.df.one$country <- as.factor(data.df$country);
data.df.one$market_segment <- as.factor(data.df$market_segment);
data.df.one$reserved_room_type <- as.factor(data.df$reserved_room_type);
data.df.one$assigned_room_type <- as.factor(data.df$assigned_room_type);
data.df.one$booking_changes <- as.factor(data.df$booking_changes);
data.df.one$deposit_type <- as.factor(data.df$deposit_type);
data.df.one$agent <- as.factor(data.df$agent);
data.df.one$company <- as.factor(data.df$company);
data.df.one$customer_type <- as.factor(data.df$customer_type)

Proposed Exploratory Data Analysis

Plan to Uncover New Information

We will discover information in the data that is not self-evident by creating a model for the explanatory relationships between different variables and cancellations.

Tables and Plots Used

We will provide histograms, boxplots, an ANOVA table for our numeric variables, a single regression model with a line of best fit for each of the numeric variables, and a summary statistics table.

Need to Learn

We will need to learn more about modeling including how to create models, how to choose the best model for our data, and useful packages for creating models. We also need to learn how to

Machine Learning Techniques

We will incorporate machine learning techniques by incorporating linear regression.

Hotel Cancellation Risk Analysis - Midterm

Olivia Anger & Paul Messerly

10/27/2020