Hotels often overbook based on the assumption that a certain percentage of guests will cancel their reservations. Overbooking protects hotels from losses due to operating below capacity, however it also creates additional risk. If there are fewer cancellations than forecasted, hotels will be unable to provide rooms for guests which can result in the loss of loyal customers and brand tarnishment. Determining the optimal rate of overbooking serves the best interests of both hotel firms and consumers by enabling firms to maximize profits while adequately satisfying consumer needs.it is in the best interest to determine. A major hotel corporation has requested our services to determine the optimal rate of overbooking in order to minimize risk and maximize profits.
To determine the optimal overbooking rate for our client we will analyze comprehensive hotel booking data collected from 2015-2017. We will analyze numerous variables to determine their relationship with the cancellation rate variable.
We will utilize a variety of approaches and analytic techniques to analyze and make sense of our data. Our techniques include exploratory data analysis, analysis of the variance, and linear regression
Gaining insight on the relationship of different variables with the cancellation rate variable will enable us to advise our client on the factors that must be considered when determining the ideal overbooking rate. Overall this will enable our client to utilize the most efficient cancellation rate to maximize profits and minimize risk.
Our data was obtained from Antonio, Almeida, and Nunes (2019)
setwd("C:/Users/anger/OneDrive - University of Cincinnati/BANA 7025/BANA 7025");
data.df <- read.csv("C:/Users/anger/OneDrive - University of Cincinnati/BANA 7025/hotels.csv", stringsAsFactors = FALSE)
The data was originally collected for the role of research in various fields such as education in revenue management, machine learning, and data mining. The data is a culmination of two data sets - one on a resort hotel and the other on a city hotel. In total the data has 31 variables, with 119,390 observations. In this context, observations represent hotel bookings. The data was collected from July of 2015 to August 2017. The data appears to be relatively clean. However, different variables use inconsistent naming conventions to classify missing values. Missing values are classified as NA, undefined, and Null throughout the data. We will utilize data cleaning techniques to uniformly classify missing values and make a strategic decision on whether to omit, change, or keep missing values.
We will clean the data to create consistency and ensure it can be thoroughly and efficiently analyzed.
To clean the data we will: * Ensure variables names follow a proper, uniform naming convention + Snake case was selected as the designated naming convention + Source data was received in selected naming convention
data.df$meal <- replace(data.df$meal, data.df$meal=="Undefined", "none");
data.df$meal <- replace(data.df$meal, data.df$meal=="SC", "none")
data.df$children <- replace(data.df$children, data.df$children=="NA",0);
data.df <- data.df[-c(31,32)]
The category “Undefined” in market_segment and distribution_channel were changed to “none”
distribution_channel was removed to avoid correlation between market_segment and distribution_channel in our analysis.
data.df <- data.df[-c(16)]
data.df$agent <- replace(data.df$agent, data.df$agent=="NULL","none");
data.df$company <- replace(data.df$company, data.df$company=="NULL","none")
data.df.one <- data.df
data.df.one$hotel <- as.factor(data.df$hotel);
data.df.one$arrival_date_year <- as.factor(data.df$arrival_date_year);
data.df.one$arrival_date_month <- as.factor(data.df$arrival_date_month);
data.df.one$meal <- as.factor(data.df$meal);
data.df.one$country <- as.factor(data.df$country);
data.df.one$market_segment <- as.factor(data.df$market_segment);
data.df.one$reserved_room_type <- as.factor(data.df$reserved_room_type);
data.df.one$assigned_room_type <- as.factor(data.df$assigned_room_type);
data.df.one$booking_changes <- as.factor(data.df$booking_changes);
data.df.one$deposit_type <- as.factor(data.df$deposit_type);
data.df.one$agent <- as.factor(data.df$agent);
data.df.one$company <- as.factor(data.df$company);
data.df.one$customer_type <- as.factor(data.df$customer_type)
We will discover information in the data that is not self-evident by creating a model for the explanatory relationships between different variables and cancellations.
We will provide histograms, boxplots, an ANOVA table for our numeric variables, a single regression model with a line of best fit for each of the numeric variables, and a summary statistics table.
We will need to learn more about modeling including how to create models, how to choose the best model for our data, and useful packages for creating models. We also need to learn how to
We will incorporate machine learning techniques by incorporating linear regression.